API Edge Cases Undocumented

TheBlueMatt commented 7 years ago

Some fun things callers need to be aware of:

You may not call wirehair_read twice with the same packet ID, this causes decode failure.
Memory lifetimes generally arent documented - during encode the memory you pass into wirehair_encode must be available until the last time you call wirehair_write (and maybe later? probably not), whereas the same is not true for writing.

In other news, for my application I hardcoded the chunk size, assumed 16/32-byte alignment (depending on where its used) for XOR, replaced the different-compile-unit XOR code with simple avx-/sse-based intrinsics and saw something like a 3x speedup :).

catid commented 7 years ago

Nice speedup! I have a newer version of the codec that needs some more work for tuning over in wh256 repo that should be similarly fast? Maybe I can learn something from your memxor() code

On Nov 15, 2016 5:42 PM, "Matt Corallo" notifications@github.com wrote:

Some fun things callers need to be aware of:

You may not call wirehair_read twice with the same packet ID, this causes decode failure.

Memory lifetimes generally arent documented - during encode the memory you pass into wirehair_encode must be available until the last time you call wirehair_write (and maybe later? probably not), whereas the same is not true for writing.

In other news, for my application I hardcoded the chunk size, assumed 16/32-byte alignment (depending on where its used) for XOR, replaced the different-compile-unit XOR code with simple avx-/sse-based intrinsics and saw something like a 3x speedup :).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/catid/wirehair/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPZIXXjw8bMiuZd4uVyKcLVw0lK0ducks5q-l9sgaJpZM4KzPvJ .

TheBlueMatt commented 7 years ago

Oh, also, if you set ALL_ORIGINAL and receive all blocks from non-FEC chunks, even though wirehair has copied each chunk into its memory it will only give you back 0s when you ask for chunk reconstructions.

TheBlueMatt commented 7 years ago

The optimizations in question are at https://github.com/bitcoinfibre/bitcoinfibre/commit/eafd1ca367cec2e6308fb0fb5d584b845fcc873f though note that I really only benchmarked in my benchmark tool, which works on relatively large data (~1MB @ 1024-byte chunks). Sadly, I think the initial setup/EncodeFeed adds too much latency for my application (since I'm bound by, essentially, setup/encode/first packet->setup/decode with low loss latency). Any pointers on reducing the initial setup latency, even if it comes at the cost of a handful of additional packets?

catid commented 7 years ago

One thing you can do is have the setup run in a background thread and when it's complete start generating output symbols. Since the first ~1000 symbols are going to be the originals you have some time to allow the CPU to catch up.

cm256 is faster for a small number of symbols (about 32). But wirehair with speedups is much faster after that point. More details here: https://github.com/catid/wh256/blob/master/docs/wh256_performance.pdf

TheBlueMatt commented 7 years ago

Sadly for my application the receiving end has the vast majority of source symbols before it even receives the first packet (just needs a few packets of metadata to figure out how to place the data in FEC chunks). The FEC is just to communicate some random missing data, essentially, and since it realistically normally only ever needs a few packets to fill in the gaps the latency-to-first-FEC-packet is critical.

catid commented 7 years ago

Maybe something like this is more what you need?

https://github.com/catid/shorthair

That project could be improved a lot by: (1) Using cm256 instead of longhair and (2) by using an approach like Tetrys instead https://www.ietf.org/proceedings/86/slides/slides-86-nwcrg-1.pdf

(2) requires a bunch more work that I've started but haven't gotten into a working state yet

TheBlueMatt commented 7 years ago

I looked at shorthair, though it looks like a full network implementation and everything above...I've already got my own networking layer with lots of random crap that I need for my specific use-case.

I'm not sure that cm256 and sliding window are really ideal...let me describe a bit more:

I'm sending Bitcoin blocks around a global network of servers at incredibly low-latency (10-20 milliseconds slower than the RTT/2 between the servers, usually, including through the GFW), so: every 10 minutes I get a new block (1-4MB) which contains almost entirely transactions that both sides have already seen, and I need to get it to the other side fast. I send a small set of data to encode which transactions go where in the ultimate block, send that (1-2 ms, including FEC, though that should be easily optimized by using cm256, etc), and then send FEC data for the whole block. Currently this works OK with some pretty shitty LDPC FEC code, though the code isnt well-optimized and is slightly frightening in its memory usage. Was hoping to switch to wirehair but it ends up wasting too much time in the initial encode by about 2x (I havent benchmarked, but I assume it has to do with calculating the GF(2^8) RS code data) to be a strict gain.

catid commented 7 years ago

How big is the data to send? Sounds like less than 1 MB diffs that would encode faster than full data

On Nov 16, 2016 4:56 PM, "Matt Corallo" notifications@github.com wrote:

I looked at shorthair, though it looks like a full network implementation and everything above...I've already got my own networking layer with lots of random crap that I need for my specific use-case.

I'm not sure that cm256 and sliding window are really ideal...let me describe a bit more:

I'm sending Bitcoin blocks around a global network of servers at incredibly low-latency (10-20 milliseconds slower than the RTT/2 between the servers, usually), so: every 10 minutes I get a new block (1-4MB) which contains almost entirely transactions that both sides have already seen, and I need to get it to the other side fast. I send a small set of data to encode which transactions go where in the ultimate block, send that (1-2 ms, including FEC, though that should be easily optimized by using cm256, etc), and then send FEC data for the whole block. Currently this works OK with some pretty shitty LDPC FEC code, though the code isnt well-optimized and is slightly frightening in its memory usage. Was hoping to switch to wirehair but it ends up wasting too much time in the initial encode by about 2x (I havent benchmarked, but I assume it has to do with calculating the GF(2^8) RS code data) to be a strict gain.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/catid/wirehair/issues/2#issuecomment-261122100, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPZIVUSmoqWilVSZGhFujVWlaKNL0jJks5q-6Y1gaJpZM4KzPvJ .

TheBlueMatt commented 7 years ago

It ranges widely...There's some data at http://bitcoinfibre.org/stats.html but, essentially, 90-something% of the time you get the full block reconstructed with ~40k of data, but a non-trivial amount of time it can take 1MB+ of data before the current LDPC stuff completes a reconstruction.

TheBlueMatt commented 7 years ago

Oh, and I really would prefer to not introduce of knowledge-of-receiver-side-data assumption, because speed-of-light makes that a bitch :(.

catid commented 7 years ago

You could give wh256 a shot. It switches to cm256 when it is faster to do that, for small data. Tuning is not done yet but I can get that done this week. Also could build you a codec that just does xors and no GF256 stuff that would require more overhead but encode faster..

On Nov 16, 2016 5:27 PM, "Matt Corallo" notifications@github.com wrote:

Oh, and I really would prefer to not introduce of knowledge-of-receiver-side-data assumption, because speed-of-light makes that a bitch :(.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/catid/wirehair/issues/2#issuecomment-261127251, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPZIbdFChRDqcCDflH8MYxErd6Pf-Seks5q-61ygaJpZM4KzPvJ .

catid commented 7 years ago

You could give wh256 a shot. It switches to cm256 when it is faster to do that, for small data. Tuning is not done yet but I can get that done this week. Also could build you a codec that just does xors and no GF256 stuff that would require more overhead but encode faster..

On Nov 16, 2016 5:27 PM, "Matt Corallo" notifications@github.com wrote:

Oh, and I really would prefer to not introduce of knowledge-of-receiver-side-data assumption, because speed-of-light makes that a bitch :(.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/catid/wirehair/issues/2#issuecomment-261127251, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPZIbdFChRDqcCDflH8MYxErd6Pf-Seks5q-61ygaJpZM4KzPvJ .

TheBlueMatt commented 7 years ago

Yea, I think I was hoping a no-GF256 codec would be a simple constant-change that I could hack up...happy to get you a beer if its more involved :).

gmaxwell commented 7 years ago

Hi, -- In addition to patching over packet loss, this system ab(uses) the forward error correction to create a diff where you have no idea what the remote side knows, which we don't because it's a distributed system with data coming in every which direction and multiple sources... and 100ms+ one way delays.

Two potential source nodes might even receive the same data at once, and both can usefully contribute to all the receivers recovering as fast as possible (as they can send different ranges of recovery data).

The currently live system uses a LDPC erasure code which decodes exclusively via pealing (and has a fairly low quality implementation that Matt has somewhat fixed up). It's fast-- but very high overhead (e.g. 20% more packets than a RS could would have needed). I encountered whirehair a while back and suggested Matt give it a shot-- the performance is really impressive overall. I think what is frustrating him now is the time to the first recovery packet (as mentioned, he doesn't send the source packets because they're reconstructed on the far end).

As an aside, if you want some number crunching done for coming up with except-seeds for wh256 I have a good hundred cores sitting idle for the moment and wouldn't mind running something for you. I was suggesting WH256 to Matt last night, with the hope that the faster GF256 parts might help some of his issues. (Prior to that I tried turning the heavy row count down but got lost trying to figure out why it crashing.)

catid commented 7 years ago

Two other random thoughts: (1) MTU size increase would improve latency in the netwotk, if you can push it above 1024 bytes. (2) The decoder recreates the recovery set during decoding, so it would not be necessary to reinitialize the encoder afterwards if the codec could transition into an encoder mode with the samr source data.

On Nov 16, 2016 5:42 PM, "Matt Corallo" notifications@github.com wrote:

Yea, I think I was hoping a no-GF256 codec would be a simple constant-change that I could hack up...happy to get you a beer if its more involved :).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/catid/wirehair/issues/2#issuecomment-261129743, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPZIf11or7hbtIfR7Ps_wR-oVxkJVm6ks5q-7EcgaJpZM4KzPvJ .

TheBlueMatt commented 7 years ago

"The decoder recreates the recovery set during decoding" huh? I'm assuming you're referring to the GF2^8 RS stuff? Is that strictly neccessary? Seems like it should be able to get away with only doing that when needed?

catid commented 7 years ago

Recovery set is a set of symbols slightly larger than the original data that are xored together after encoder is initialized. The decoder recovers these in the case of lost data and then recovers any missing parts.

If a decoder has recovered lost data then it can be reused for encoding because the recovery set does not need to be recalculated.

On Nov 17, 2016 10:16 PM, "Matt Corallo" notifications@github.com wrote:

"The decoder recreates the recovery set during decoding" huh? I'm assuming you're referring to the GF2^8 RS stuff? Is that strictly neccessary? Seems like it should be able to get away with only doing that when needed?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/catid/wirehair/issues/2#issuecomment-261457313, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPZIRuA0ihGk-Jpjls48Uuo5iQgKZkIks5q_UKxgaJpZM4KzPvJ .

gmaxwell commented 7 years ago

That is useful for the application here (once a participant receives enough to recover, they'll encode and start generating new symbols to help nodes beyond simply forwarding along already received symbols), but unfortunately it's not on the critical path (which it sounds like Matt is saying is time to first encoded symbols on nodes that received the data from outside.)

Pardon my ignorance, but is it the case that some non-trivial number of IDs would happen to not reference anything but original packets? (I'm familiar with the basic relevant FEC-tech, but ignorant of your construction except in a vague hand-waving sense :) ) if so, extracting and sending a wad of those first might resolve Matt's concern though I suppose it would seriously degrade the overhead.

catid commented 7 years ago

So I see two ways to make the encoder super-fast without writing a whole other codec:

(1) Add a feature to work in non-systematic mode, where it multiplies by the matrix to generate the recovery set and then you'd send across the network only linear xor combinations of those. Instead of sending the original data you'd send the output of the encoder.

(2) Add a larger feature to work in a mixed mode, where it again multiplies the matrix to generate the recovery set, but then it allows you to send the original data followed by some random combinations of the original data as in the systematic code. It then requires a more complicated decoder to solve the recovery set in two stages.

I've not fully thought through (2) so maybe the decoder would be much slower for 50-50 original data and recovery data? (1) is a safer way forward.

I tried out wh256 for your data sizes and it takes about 1.3-1.6 milliseconds to initialize the encoder. Is that really significant?

TheBlueMatt commented 7 years ago

So I hacked up some code to cache the matrix generation, which seems to be a nontrivial improvement in synthetic and real-world benchmarks, though it doesnt show up in callgrind almost at all (and I havent hooked a real profier up to it yet) (see code at https://github.com/bitcoinfibre/bitcoinfibre/commit/d32a9a2379444d4735b089a206ef9fc2cd7d06d9, though I'm pretty sure the copy stuff is busted since I didnt bother to audit anything).

I also swapped out for wh256 and it seemed to improve a bit more (getting within spitting distance of the naiive LDPC stuff I had previously - so probably worth swapping out at this point). Do you have the code to generate seeds handy and I could spin it up one some beefy servers?

catid commented 7 years ago

Would you guys mind also evaluating https://github.com/catid/siamese ?

gmaxwell commented 7 years ago

Neat! For Fibre the targeting for few recovery symbols will likely get in the way, but we have some related things where the ability to have a rolling window of changing data would be very useful. (... once I figure out how that actually works with Siamese :) ).

catid commented 7 years ago

Oh I thought the files were like 1 MB, so 1000 pieces, and 256 limit on output would be 25% - should be higher than loss right?

catid commented 7 years ago

Actually for that use case it would be nicer if Siamese had an API that supported fixed-sized input blocks that are maintained outside of the library to avoid some copies. It's probably still within 20% of current performance though

TheBlueMatt commented 7 years ago

Well FIBRE has a few strange properties - because we (usually) have almost the entire 1MB of data at the other end before we start sending, we really want to get the data in just a handful of packets. However, because it occasionally has much lower hitrate, we also need to be able to get the data at the other end after having just sent a bunch of recovery symbols. I suppose I could do some hack where it first sends 256 recovery symbols and then starts sending data packets, but then you'd waste a bunch of packets on data that is useless to the other side :/.

catid commented 7 years ago

If I increased the number of recovery packets to some larger number that would work for you? It's somewhat arbitrary

gmaxwell commented 7 years ago

For FIBRE we really want recovery packets to be at least equal to the input packet count, twice the input size is even better.

Here is a brief and somewhat simplified outline of what we do w/ FIBRE today:

There is a small collection of nodes spread around the world.

Our task will be to get a block from a random source to all the targets with at little latency as possible. The nodes have gigabit+ connectivity, and the distance between them is such that a single round trip anywhere will dwarf the transmission time.

Each have a many megabyte queue of transactions that will potentially end up in blocks, learned via a gossip protocol and prioritized based on their odds of showing up soon. These queues are roughly but not precisely consistent among the participating nodes.

At one of these nodes, a block shows up (poisson process with expectation of ~10 minutes) A block shows up and most of the time 99.9% of the transactions in it will be already known. The block is about 1MB in size and has about 3000 txn in it.

The receiving node makes a sketch of the block-- a collection of short hashes and sizes for each of the transactions.. this is on the order of 15 kbytes in size. It's streamed out to peers over UDP with a bit of FEC.

The targets get this data use the hashes and sizes to mock up the block in memory. There is a size based permutation applied to the transaction order that also adds a bit of padding so that when the block gets packetized missing transactions will cause fewer packet erasures.

The source performs the same permutation+packetization process and runs a FEC encoder on it (currently wirehair, originally a more boring LDPC code). It then starts streaming out recovery packets, interleaving between peers (you can think of it as sending half east around the world, half west around the world).

There is no congestion control, rather each node has a preconfigured transmission rate that is adjusted to send as fast as possible without causing queuing delays/losses.

The targets feed their locally generated (from their queues + sketches) into the FEC decoder.

When a peer receives a recovery packet it may pass it on to another peer. (There is a precomputed forwarding topology that minimizes latencies and avoids peers getting duplicates of the same data-- it's not sufficient to simply have the source send directly, getting the latency minimizing path can require passing through other nodes).

Once a target has enough data, it reconstructs the whole block. It sends a packet to each of its peers telling them to not bother transmitting any more about the block to it. It will then FEC encode it and begin sending out novel chunks which it never received to whatever peers haven't told it that they're done. (this last point isn't that important except when there are outages, though it's easy to do with LDPC-like codes)

Sometimes a block will show up which isn't very consistent with nodes queues and may have a much smaller hitrate. (e.g. a few percent) or nodes might have been recently restarted and not have a useful queue yet. In those cases, their reconstruction will take more data, but we still must avoid a round-trip.

Also, a block might show up at multiple sources about the same time-- so it is useful if both can operate independently sending non-overlapping packet sets, and both contribute usefully to the reconstruction... rather than having a bunch of duplicates show up due to sources which can't know about each other's transmissions (outside each other's light cone).

As you can see in this model the source never sends any of the input packets, because any of them potentially duplicate what the target already knows.

Often recovery only needs a couple packets past the sketch... maybe two dozen (I asked Matt to get a historgram of that). But sometimes the hitrate is low and it needs an amount equal to the input. Because the source sends packets in multiple directions around the world, it's also good if the recovery count is 2x whats needed, so non-duplicates can be sent in different directions.

The source can more or less detect in advance of sending if the hitrate is going to be very low: because the block will also not coincide well with the source's queue. We could potentially switch coding schemes based on that. We just can't wait for replies from the targets e.g. to tell us what they were missing so that we can just send original data-- because the whole process should be done before that reply would be received.

I think it would be fine for us if the encoder could produce more recovery packets but decode was somewhat slower/less efficient if many were needed... otherwise we could potentially switch from WH to Siamese based on the expected queue hitrate.. and if siamese is use, after sending all the recovery packets start streaming out the input until we're told to stop. If that would be attractive would depend on the performance. Right now, WH encode+decode performance is a major factor in the controllable part of our total latency. (which 95% of the time is less than 16ms over the one-packet latency).

catid commented 7 years ago

I appreciate the extra detail; let me think about other methods that may help.

Okay I'll put together a specialized block codec that can produce more output symbols (with higher perf) so you can evaluate that as an alternative.

TheBlueMatt commented 7 years ago

Ideally, of course, the decode would be iterative on packets to avoid taking too much CPU time if we do need a lot of packets - even at 1Gbps packets aren't coming in that fast, so you've got a bit of time to process things, and I've already got infrastructure to fill the block with transactions on a background thread (which used to be an expensive process with lots of mallocs, but is now largely free) which could easily be converted to a "do a step of something useful in FEC decoder" thread.

catid commented 7 years ago

Please evaluate this library for your purposes: https://github.com/catid/fecal

TheBlueMatt commented 7 years ago

Hmm, I hacked in a naive version into my fec wrapper and it turned out quite a bit slower (2x on my few-missing-chunks test, which is around 14 missing chunks out of 880).

You can see the fec wrapper changes at https://github.com/bitcoinfibre/bitcoinfibre/commit/6f698938516e44f68462e0c82c37dec51cc06aed#diff-208ddd15cda5e81d824f18d7daffdde1 it has some extra memcpys that add a bit of overhead, but at least callgrind's cycle estimation thinks the fecal_decode is where nearly all the time ended up.

I'm happy to provide callgrind outputs for your analysis (or you can run the bench_bitcoin binary in that repo using ./autogen.sh && CXXFLAGS="-march=native -mtune=native -O2 -g" ./configure --disable-wallet && make && ./src/bench/bench_bitcoin)

catid commented 7 years ago

Disappointment =(

catid commented 7 years ago

Well, I'll get back to you guys next time I have some new software to share. Feel free to reach out to me at mrcatid@gmail.com for more support.

catid commented 7 years ago

If you're interested in evaluating another codec, I just put up this one:

https://github.com/catid/leopard

It's limited in that the recovery data cannot be streamed out like a fountain code. Also the recovery set cannot be larger than the original data size. Hope it's useful.

TheBlueMatt commented 7 years ago

Damn, super close, but didnt beat wh256 :(. Indeed, the lack of streaming output/incremental decode really hurts this one for FIBRE. In my synthetic benchmark on my Haswell-EP workstation, wh256 has ~1.5ms of setup time, then streams out enough chunks to get a decode super fast, with decode run taking around 1.5-1.7ms ignoring encode time. leopard is closer to 1.5 on the encode, 4 on the decode. It got a bit faster after I hardcoded the block size like I do in wh256, but only by a ms or 2 in total, which wasn't quite enough to get over the hump.

(so, in conclusion, one thing to consider for future fecs is an option to hardcoded fec chunk block size - in FIBRE I use fixed-size UDP packet which makes tons of things lots simpler, and hardcoding the block size all the way down the the XOR functions can be a decent gain).

catid commented 7 years ago

Haha damn. Okay I will keep trying

The encoder and decoder should get about 50% faster as I finish DIT FFT optimization for 16 bit field and thinking about 12 bits for speed for apps like yours that operate around 1-4 MB file size

On Fri, Jun 2, 2017 at 1:47 PM Matt Corallo notifications@github.com wrote:

Damn, super close, but didnt beat wh256 :(. Indeed, the lack of streaming output/incremental decode really hurts this one for FIBRE. In my synthetic benchmark on my Haswell-EP workstation, wh256 has ~1.5ms of setup time, then streams out enough chunks to get a decode super fast, with decode run taking around 1.5-1.7ms ignoring encode time. leopard is closer to 1.5 on the encode, 4 on the decode. It got a bit faster after I hardcoded the block size like I do in wh256, but only by a ms or 2 in total, which wasn't quite enough to get over the hump.

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/catid/wirehair/issues/2#issuecomment-305905227, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPZIZw8CUJA2RR3A1XFbMyfaXTM4VCeks5sAHTmgaJpZM4KzPvJ .

TheBlueMatt commented 7 years ago

Oh wow, alright, that might just do it to beat wh256. I can try to PR a decent version of the optionally-hardcode-fec-size, if you'd like.

catid commented 7 years ago

Cool will let ya know when it is ready to test

On Fri, Jun 2, 2017 at 3:10 PM Matt Corallo notifications@github.com wrote:

Oh wow, alright, that might just do it to beat wh256. I can try to PR a decent version of the optionally-hardcode-fec-size, if you'd like.

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/catid/wirehair/issues/2#issuecomment-305914640, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPZIRMSTnq0uT2BtkXxKONYnDSEWsr8ks5sAH_YgaJpZM4KzPvJ .

catid commented 7 years ago

Okay I just finished the biggest optimizations for 16-bit finite fields. Should be about 50% faster to encode and 50% faster to decode.

On Fri, Jun 2, 2017 at 3:14 PM, Christopher Taylor mrcatid@gmail.com wrote:

Cool will let ya know when it is ready to test

On Fri, Jun 2, 2017 at 3:10 PM Matt Corallo notifications@github.com wrote:

Oh wow, alright, that might just do it to beat wh256. I can try to PR a decent version of the optionally-hardcode-fec-size, if you'd like.

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/catid/wirehair/issues/2#issuecomment-305914640, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPZIRMSTnq0uT2BtkXxKONYnDSEWsr8ks5sAH_YgaJpZM4KzPvJ .

TheBlueMatt commented 7 years ago

Nice: for a single test block (~800 1152-byte chunks) it went from ~1.3 on the encode to ~1.05. On the decode side, ~4.5-> ~3.6. Still, compared to wh256's ~1.2 to start encoding, ~1.56 to do the full encode and ~1.1 to do a decode, wh256 still wins by a healthy margin. I'd be curious to see the 12-bit field.

(fec-al numbers, since I went and did things a bit more formally so went back and benchmarked that: ~0.2ms to start streaming, ~0.5 to do a full encode, ~1ms to decode in my low-loss bench, which is realistic for some reasonably-high-percent of cases, but the ~6ms to decode in my medium-loss bench makes it hard to swallow the lack of consistency :( ).

catid commented 7 years ago

If you allow for multithreading it can decode quite a lot faster. I am still working on gaining another 15% or so performance by moving copies/xors into the inner loops. Then MT after that's done, which could cut the decode time in half or more...

12-bit fields are going to be a little faster in that there would be less register pressure in the inner loops. FFT_m and IFFT_m and FormalDerivative_m steps would not be working on smaller data sizes though. Also instead of a simple "multiple of 64 bytes" requirement it would be some weird number like 96.. I'm more interested in doing an XOR-only version for platforms that do not have PSHUFB instructions like quadcopter processors.

catid commented 7 years ago

So Sunday and tonight I managed to implement a worker thread pool and got somewhat tuned work spread across the threads. In practice it's not quite a 2x performance boost to add multi-threading:

Parameters: [original count=1000] [recovery count=1000] [buffer bytes=1344] [loss count=1000] [random seed=2] (multi-threading OFF) Leopard Encoder(1.344 MB in 1000 pieces, 1000 losses): Input=1745.45 MB/s, Output=1745.45 MB/s Leopard Decoder(1.344 MB in 1000 pieces, 1000 losses): Input=491.947 MB/s, Output=491.947 MB/s

Parameters: [original count=1000] [recovery count=1000] [buffer bytes=1344] [loss count=1000] [random seed=2] (multi-threading ON) Leopard Encoder(1.344 MB in 1000 pieces, 1000 losses): Input=1669.57 MB/s, Output=1669.57 MB/s Leopard Decoder(1.344 MB in 1000 pieces, 1000 losses): Input=691.358 MB/s, Output=691.358 MB/s

I'm not 100% sure that I'm spreading the work across the workers the most optimal way, but it looks like about a 40% performance improvement for the case you care about. I don't think this is enough to make you prefer Leopard over Wirehair. Also my thread pool only works on Windows.

Bulat-Ziganshin commented 7 years ago

In practice it's not quite a 2x performance boost to add multi-threading:

your cpu is 4 GHz in s/t mode and 2.8 GHz in m/t mode

catid commented 7 years ago

Switched from my own threadpool to OpenMP and tuned more:

Leopard Encoder(1.344 MB in 1000 pieces, 1000 losses): Input=3162.35 MB/s, Output=3162.35 MB/s Leopard Decoder(1.344 MB in 1000 pieces, 1000 losses): Input=714.514 MB/s, Output=714.514 MB/s

Maybe this will be good enough?

catid commented 6 years ago

New version of the codec is up

TheBlueMatt commented 6 years ago

Hey! Sorry, somehow missed the first message and just got around to testing the new wirehair updates. Even with the OpenMP threadpool the leopard updates weren't a winner. They still did better than wirehair/wh256 on most/all-chunks-missing benchmarks, but on tests where most chunks were locally available before sending starts, they failed to quite reach the wh256/wirehair speed (but...)

The wirehair updates did, indeed, provide a pretty healthy boost, though I went ahead and implemented the behavior of wh256 where it uses cm256 if the chunk count is below some threshold, which appeared to help my small-chunk-count reconstruction runs (recall, my use-case is two-pass: the first is around 5-10 chunks of 1152 bytes and never has any pre-knowlege, the second is usually around 800 chunks of 1152 bytes, but usually upwards of 780 chunks are available on the receiving end before any packets are even sent, as they can be calculated from the first pass). I did not, however, test backporting this behavior to leopard, but it looked like leopard lost to wirehair normally anyway, so I'd guess it wont save it.

catid commented 6 years ago

Cool glad to know that helped

On Thu, May 3, 2018 at 3:25 PM Matt Corallo notifications@github.com wrote:

Hey! Sorry, somehow missed the first message and just got around to testing the new wirehair updates. Even with the OpenMP threadpool the leopard updates weren't a winner. They still did better than wirehair/wh256 on most/all-chunks-missing benchmarks, but on tests where most chunks were locally available before sending starts, they failed to quite reach the wh256/wirehair speed (but...)

The wirehair updates did, indeed, provide a pretty healthy boost, though I went ahead and implemented the behavior of wh256 where it uses cm256 if the chunk count is below some threshold, which appeared to help my small-chunk-count reconstruction runs (recall, my use-case is two-pass: the first is around 5-10 chunks of 1152 bytes and never has any pre-knowlege, the second is usually around 800 chunks of 1152 bytes, but usually upwards of 780 chunks are available on the receiving end before any packets are even sent, as they can be calculated from the first pass). I did not, however, test backporting this behavior to leopard, but it looked like leopard lost to wirehair normally anyway, so I'd guess it wont save it.

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/catid/wirehair/issues/2#issuecomment-386455565, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPZIZUvOwsKLeIm8xphE2HiSl0pwotsks5tu4PwgaJpZM4KzPvJ .

catid commented 5 years ago

Not sure if it's interesting to you but I recently uploaded a new project here: tonk.io On Thu, May 3, 2018 at 3:53 PM Christopher Taylor mrcatid@gmail.com wrote:

Cool glad to know that helped

On Thu, May 3, 2018 at 3:25 PM Matt Corallo notifications@github.com wrote:

Hey! Sorry, somehow missed the first message and just got around to testing the new wirehair updates. Even with the OpenMP threadpool the leopard updates weren't a winner. They still did better than wirehair/wh256 on most/all-chunks-missing benchmarks, but on tests where most chunks were locally available before sending starts, they failed to quite reach the wh256/wirehair speed (but...)

The wirehair updates did, indeed, provide a pretty healthy boost, though I went ahead and implemented the behavior of wh256 where it uses cm256 if the chunk count is below some threshold, which appeared to help my small-chunk-count reconstruction runs (recall, my use-case is two-pass: the first is around 5-10 chunks of 1152 bytes and never has any pre-knowlege, the second is usually around 800 chunks of 1152 bytes, but usually upwards of 780 chunks are available on the receiving end before any packets are even sent, as they can be calculated from the first pass). I did not, however, test backporting this behavior to leopard, but it looked like leopard lost to wirehair normally anyway, so I'd guess it wont save it.

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub, or mute the thread.

TheBlueMatt commented 5 years ago

Hmm, sadly the stuff I've got is all super-context-specific so a more general non-FEC-only framework probably doesn't apply. Thanks for the update, though!

catid / wirehair

API Edge Cases Undocumented #2