cisco / libsrtp

Library for SRTP (Secure Realtime Transport Protocol)
Other
1.21k stars 474 forks source link

Change stream list to arrays and optimize stream lookup with SSE2 #508

Open Lastique opened 3 years ago

Lastique commented 3 years ago

The stream list in the SRTP context is now implemented with two arrays: an array of SSRCs and an array of pointers to the streams corresponding to the SSRCs. The streams no longer form a single linked list.

Stream lookup by SSRC is now performed over the array of SSRCs, which is considerably faster because it is more cache-friendly. Additionally, the lookup is optimized for SSE2, which provides an additional massive speedup with many streams in the list. Although the lookup still has linear complexity, its absolute times are reduced and with tens to hundreds elements are lower or comparable with a typical rb-tree equivalent.

Expected speedup of SSE2 version over the previous implementation:

SSRCs    speedup (scalar)   speedup (SSE2)

1        0.39x              0.22x
3        0.57x              0.23x
5        0.69x              0.62x
10       0.77x              1.43x
20       0.86x              2.38x
30       0.87x              3.44x
50       1.13x              6.21x
100      1.25x              8.51x
200      1.30x              9.83x

At small numbers of SSRCs the new algorithm is somewhat slower, but given that the absolute and relative times of the lookup are very small, that slowdown is not very significant.

The performance benchmark is written in C++ and is attached below, along with test results on a Core i7 2600K (running Ubuntu 20.04, gcc 9.3). The test emulates the current implementation by using std::list, and estimates a potential RB-tree implementation performance by using std::set.

This PR is somewhat related to https://github.com/cisco/libsrtp/pull/507. It relies on the change to config_in_cmake.h made in that PR so that the SSE2 code is enabled for MSVC. Without it MSVC will use scalar code. In all other respects this PR is self-contained (i.e. non-MSVC compilers will use SSE2 intrinsics code).

This PR is related to https://github.com/cisco/libsrtp/issues/417. Although it doesn't change the formal O(N) complexity, it does reduce lookup times significantly and shows comparable or better performance than std::set (see the attached test results).

ssrc_map_test.tar.gz

pabuhler commented 3 years ago

Hi, improving the performance of this list and associated lookup has been our list of things to do for a while. May I ask what is the typical size of the list that you use, ie what where optimizing for? In my most common use case the size of the list is usually less than 10 with a max of around 30.

Lastique commented 3 years ago

10-15 is a good estimate for our current common case, where we use 1 audio section (1 local and 1 remote SSRC) and up to 2 video sections with 2 or 3 simulcast substreams (1 media and 1 RTX SSRC per substream per side) in a bundle. We are planning to create more media sections, ideally bound by the number of participants in a conference, which is where the number of SSRCs may grow N-fold. We also plan to add support for FlexFEC, which will add +1 SSRC per video substream per side.

murillo128 commented 3 years ago

I am doing media server cascading, and I can have easily more than 100 ssrcs in the same bundle. This would be a great boost for my use case.

Lastique commented 2 years ago

Rebased on top of the current master.

nils-ohlmeier commented 2 years ago

Awesome contribution! Definitely super helpful contribution for conference servers.

I'm wondering what people think if this change should be a compile time option only. Because I assume that client side folks, e.g. the browsers and others, are not keen on a performance hit, even if it's not a big one. So the idea would be that at compile time you could choose between optimizations for low amount of RTP streams vs high amount (which hopefully should translate to client side vs server side).

murillo128 commented 2 years ago

how about splitting the PR in two:

Even if the second one is not merged, server devs could easily provide their own implementation of the lookup functions

Lastique commented 2 years ago

I don't think a compile-time option is worth the complication considering that the lookup among a few SSRCs is super fast however you implement it. Browsers have other, much more fruitful areas of optimizations.

Consider also that libraries such as libsrtp are usually built in one variant for system packaging, and that library may be used by any kind of application, server or client. So I would prefer the code base to remain unified and optimal.

However, if there are code reorganising or other requests from libsrtp maintainers that will help getting this merged, please let me know.

eakraly commented 2 years ago

Does not look like this question was asked: what is the change (speedup, if any) for non-SSE2 implementation? x86-64 is not the only platform out there - I run most of the fleet on arm64 (and thus SSE2 does not apply)

I am not sure if @murillo128 intent was that but being able to add additional optimized implementations does make sense

In any case - looking forward to this change merging!

Lastique commented 2 years ago

@eakraly The original description shows scalar performance on x86-64 platform. I don't have an ARM test system, but there is also a benchmark attached that you can run on your system.

As for optimizations for ARM/aarch64, those would be nice. I suppose, it should be doable, but you will probably have a problem with implementing PMOVMSKB in NEON or SVE, as there isn't a direct equivalent instruction. In any case, I don't have enough experience with NEON and can't test it anyway, so someone else will have to implement it in a separate PR.

murillo128 commented 2 years ago

that's why I think splitting the PR in two, one for creating an interface for ssrc lookup and another for adding the sse2 logic would be helpful.

Lastique commented 2 years ago

that's why I think splitting the PR in two, one for creating an interface for ssrc lookup and another for adding the sse2 logic would be helpful.

I can't separate the interface change from the implementation, as the interface implies the array-based implementation (i.e. indexes are used everywhere instead of pointers). IOW, I can't wrap the old implementation in the new interface. The SSE2 part of the PR is actually minor, most of it is plain C implementing the new container and interacting with it.

And the maintainers haven't expressed opinion if this approach is acceptable at all, SSE2 or not, or whether there are any organizational or technical changes needed. It looks like they just aren't interested.

pabuhler commented 2 years ago

@Lastique I wouldn´t say uninterested but maybe unresponsive. We did discuss this and the other PR at some point and had some concerns around having a "splattering" of SSE2 in the code base. I understand the current solution is not optimal for a larger set of SSRC´s so we need a solution here. I will try to follow up this PR in the coming weeks. Would be good though if there is someone else that can help review the SSE2 part of the PR. I do like the idea of abstracting the list in to a new type and as @murillo128 suggested placing it in a separate file so it can easily be tested and support different implementations if needed. Also keeps the srtp.c file free of non SRTP details.

Lastique commented 2 years ago

I've rebased onto the current master and:

I'm still publishing this as a single PR because there is not much incentive to merge the generic array-based implementation alone. The main performance winnings come from SSE2. Also, SSE2 code depends on the previous commit that implements the generic version. However, separate commits make it easier to review the generic implementation and SSE2 separately.

Lastique commented 2 years ago

We need to add an implementation of the stream list that is non-SSE based (i.e. the current linked list one).

The old linked list implementation is incompatible with the interface required by the SSE2 implementation and therefore is removed. The only generic (non-SSE) implementation that is proposed in this PR is the array-based one.

nils-ohlmeier commented 2 years ago

Consider also that libraries such as libsrtp are usually built in one variant for system packaging, and that library may be used by any kind of application, server or client. So I would prefer the code base to remain unified and optimal.

Sure. I'm not arguing against the system packagers choose what ever they think is best for their systems. None of the browsers uses system level libraries though. They all compile their own versions into the browsers. That is why I was asking for a compile time option, and I would have no problem if the new default is to compile with the SSE2 on by default, as long as vendors who can't ship with a dependency on SSE2 support have the ability to opt out of it. Browsers also compile their binaries for the lowest common nominator. I would have to check, but I would not be surprised if that still rules out SSE2 optimizations for them. The only things which works for them is at run time detection of CPU features like SSE2.

Lastique commented 2 years ago

Browsers also compile their binaries for the lowest common nominator.

Chrome and Firefox require SSE2 for many years now (since 2014 and 2016, respectively). As does Windows 8 and later. Really, x86 processors without SSE2 belong to a museum now.

Lastique commented 2 years ago

I should also note that SSE2 is not mandatory with the proposed patch. If you disable SSE2 through the compiler switches (e.g. -mno-sse2 -mno-sse in gcc and clang) or that mode is the default on your platform, the code will compile and will use the scalar implementation.

murillo128 commented 2 years ago

As discussed in this thread, we have extracted the current stream list implementation and create an internal api + default implementation in https://github.com/cisco/libsrtp/pull/612

Lastique commented 1 year ago

Rebased on top of libsrtp 2.5.0.

The implementation replaces the default linked list based implementation. As I noted before, the generic interface is not suitable for efficient implementation of this array-based list. In particular, position and list size need to be exposed to the calling code to avoid double stream lookup in srtp_remove_stream.

This patch does not attempt to fix the generic list interface, as I find this rather pointless. Instead, it optimizes the case when the internal (now - array-based) list implementation is used, which will be the majority of use cases, I think.

If the generic list interface is to be dropped (which, IMHO, is the right thing to do), I can send another patch that removes prev/next pointers from srtp_stream_ctx_t_, as well as reorders its members to remove some of the internal padding, thus reducing the size of the structure.

Lastique commented 1 year ago

BTW, the patch now includes https://github.com/cisco/libsrtp/pull/643 fix.

jmillan commented 8 months ago

main branch does now store streams in an array rather than in a linked list. Hence, I think this PR is as least partially outdated.

paulej commented 8 months ago

I wonder how much faster your code can be using intrinsics functions?

Lastique commented 7 months ago

We're currently using (patched) libsrtp 2.5.0, and I haven't updated the PR to main yet. Not sure when I'll have time to update the PR, most likely when a new libsrtp with the new array-based stream list is released. I'll see if I can do it earlier.

From the brief look at #674, it looks quite similar to the scalar version of the stream list proposed here. The main difference is that this PR stores SSRCs in a separate array, which allows for better cache locality during lookup and, of course, is necessary for vectorized lookup. Another, admittedly minor difference is that this PR uses nomenclature that is standard in C++, i.e. size and capacity (where size means the number of elements in the container and capacity is the size of the allocated storage) instead of size and available (where size is the size of the allocated storage and available is the amount of unused storage). I'd say, the standard way is more optimal since you want the number of elements more often than the number of unused elements. This is especially so with the vectorized implementation in this PR, as the lookup result is an index, which is then compared with the number of elements to test if the stream is found.

I'll update the performance test when I update the PR.

jmillan commented 7 months ago

I'd say, the standard way is more optimal since you want the number of elements more often than the number of unused elements

That makes total sense. I've made the corresponding PR