animetosho / node-yencode

SIMD accelerated yEnc encoder/decoder and CRC32 calculator for node.js
37 stars 5 forks source link

yEnc SSE decode ideas #4

Closed Safihre closed 7 years ago

Safihre commented 7 years ago

I was amazed to find this repo, I have been thinking of some way to do yEnc-decoding (as a Python-C-extension) using SSE instructions but my knowledge of C is just too rudimentary for now.

Do you think think SSE can help compared to regular char-by-char decoding of yEnc body? How would you go about the decoding-escaping problem? I can imagine finding the escape chars, but how to remove them later on when building the output string? I tried to grasp your encoding-code, but I think I probably miss the main idea due to the included edge-cases and optimizations.

Thanks!

EDIT: I think I am getting more and more of the code and how you handle the encoding-escaping here: https://github.com/animetosho/node-yencode/blob/master/yencode.cc#L718-L752 I don't completly understand the shuffle operations just yet and how they handle the extra chars, what are shufMixLUT and shufLUT?

animetosho commented 7 years ago

SIMD isn't really designed for text processing tasks like yEnc, so implementation isn't so straight forward unfortunately. Because of this, a scalar (byte by byte) implementation usually isn't so bad, but doing a SIMD implementation is intellectually interesting and works for the epeen factor (as for Nyuu, I intend to integrate PAR2 generation, so reducing CPU usage may help a bit on fast connections).

Having said that, a SIMD implementation of yEnc encoding can usually beat the naive implementation by several factors, due to corner cases that an encoder has to deal with.

I've been thinking about writing up how the algorithm works. Unfortunately the code isn't the easiest to read (I'm a little lazy there, admittedly), but here's a rough diagram showing an overview of the main loop:

diag

Note that the byte shuffle instruction requires a CPU with SSSE3 support (which most people should have these days). If the CPU doesn't have SSSE3 support, the algorithm just switches to doing the escaping using standard non-SIMD methods if special characters are found (the 16-bit mask != 0). Despite this, it's still a bit faster than a pure scalar implementation because most bytes aren't escaped in yEnc, and hence, end up hitting the fast route of: load 16 bytes, +42 to all bytes, check special characters, store to memory.

Line endings are a bit complicated, since they have different rules to the rest of the line (more characters to escape). When the algorithm crosses a line boundary, it needs to 'revert' (back-track) until the last character of the line, process the line boundary, then go back to the main loop. A large portion of the code is there just to deal with this case.

Generating vectors for the shuffle and add operation isn't exactly fast, so we pre-generate these instead and just do a lookup when encoding data. shufLUT is a 256 x 16-byte table for the shuffle vectors, whilst shufMixLUT is the corresponding lookup table for add vectors.

Hope that all makes sense.


As for decoding, I've actually thought about it, and have tried a few things. A naive implementation is surprisingly fast, since decoding is somewhat simpler than encoding, though you could still probably beat it by a few factors with SIMD.

Here's my scalar implementation:

static size_t do_decode_scalar(const unsigned char* src, unsigned char* dest, size_t len) {
    unsigned char *p = dest; // destination pointer
    unsigned long i; // input position
    unsigned char c; // input character

    for(i = 0; i < len - 1; i++) {
        c = src[i];
        switch(c) {
            case '\n': case '\r': continue;
            case '=':
                i++;
                c = src[i] - 64;
        }
        *p++ = c - 42;
    }
    // do final char; we process this separately to prevent an overflow if the final char is '='
    if(i < len) {
        c = src[i];
        if(c != '\n' && c != '\r' && c != '=') {
            *p++ = c - 42;
        }
    }

    return p - dest;
}

I see that in your implementation, you handle two dots after a newline which I don't. yEnc doesn't actually allow such to happen (first dot must be escaped), but I suppose a valid decoder should still deal with it.

A general workflow for a SIMD version may be:

  1. load 16 bytes
  2. check for special characters, = \r \n similar to the encoding algorithm
  3. subtract 42 from all bytes
  4. if there's no special characters, simply store the vector to memory
  5. otherwise take all matches to =, shift one byte across and put the number 64 in each place. All other elements of this vector should be 0
  6. perform a vector subtract (data - vector_partially_filled_with_64)
  7. use shuffles to remove occurrences of = \r \n (similar idea to the encoding algorithm, just instead of expanding data, it compacts it)
  8. store to memory

I actually have a somewhat working implementation of the above.
Against a valid yEnc encoder this works, however, if you want to be fully spec compliant, you'll need to handle invalid yEnc representations, such as a sequence like ====. A spec compliant decoder must treat that as two characters, however the above algorithm won't, even though no good yEnc encoder should ever create such a sequence. There's also the \n.. sequence mentioned above.

I haven't yet thought of a way to deal with such invalid sequences well, but it may be doable. A compromise may be to detect such cases instead (rather than actually handle them), and if encountered, fallback to scalar processing. As I expect most yEnc implementations to be reasonably valid, it's unlikely the slower scalar fallback will ever be invoked, but remains there to be spec compliant.

Hope that's useful. I may eventually add decoding support to this library. Welcome any ideas.
Otherwise, if you wish to implement it yourself, hope that this gives you some ideas on how.

Safihre commented 7 years ago

That's a great overview, thank you! 👍

What would you do with the case where the = is the last character of the 16 byte block? So the character-to-unescape is actually in the next 16 byte block.

The double-dot thing is unrelated to yEnc but inherent to NNTP, the servers will add this extra dot when transmitting the message. While I initially ignored it as an edge case, it actually happens quite often (once per file almost).

I was experimenting a bit with implementing this yesterday, but I am sort of stuck in the "finally understanding what pointers do and how to use them"-stage.. so this stuff is a bit above my head. You say you already have a somewhat working implementation of it? That would make me and the SABnzbd users very happy 😃

animetosho commented 7 years ago

What would you do with the case where the = is the last character of the 16 byte block? So the character-to-unescape is actually in the next 16 byte block.

That's not too hard to deal with. Once you have the bit mask indicating the location of = characters:

escape_first_char = mask >> 15;

// on next loop
data = read_next_16_bytes();
// check for special characters
if(escape_first_char)
  data = vector_subtract(data, make_vector(42, 42, 42... 42+64)); // note, first byte is at the end
else
  data = vector_subtract(data, make_vector(42, 42, 42... 42));

// ...or, you could optimize the above four lines with a small LUT based on escape_first_char

While I initially ignored it as an edge case, it actually happens quite often (once per file almost).

That's interesting, because a correctly yEnc encoded message should never have a dot as the first character of a line.
Perhaps these messages were encoded incorrectly?

but I am sort of stuck in the "finally understanding what pointers do and how to use them"-stage

I'm sure you'll get it. It's likely a bit different to what you're used to, but you seem to be a very experienced programmer, so it won't take you long to understand.
Programming in SIMD may be more challenging as it requires you to think a bit differently to how you'd usually write code, and stuff like yEnc can require a little creativity. But it just boils down to yet another programming problem, and use it enough and you should be fine too.

Safihre commented 7 years ago

Perhaps these messages were encoded incorrectly?

No, it's not yEnc et al. It's the NNTP server that has to add a dot to every line that starts with a dot per-NNTP-spec during the transmission, some leftovers from ancient times I think.

animetosho commented 7 years ago

It's the NNTP server that has to add a dot to every line that starts with a dot

But there should be never be a line that starts with a dot, no?

Safihre commented 7 years ago

Why not? It's not a special character in the already encoded data, just a char of which 42 can be subtracted.

animetosho commented 7 years ago

Looks like I made a mistake - had thought that yEnc requires dots to be escaped if it's the first character in a line, but re-reading the spec, it's actually optional. Sorry about that.
So it looks like that will need to be handled by most decoders.

animetosho commented 7 years ago

So I've been thinking about this, and I think that dealing with invalid sequences can be solved. But the double-dot issue looks challenging.

Looking at RFC3977:

When a multi-line block is interpreted, the "dot-stuffing" MUST be undone; i.e., the recipient MUST ensure that, in any line beginning with the termination octet followed by octets other than a CRLF pair, that initial termination octet is disregarded

This seems to define how invalid sequences such as \n.a are handled - the dot at the start of the line needs to be discarded, regardless of what comes after it. I can't tell if the document covers how sequences like a\n.. should be handled, other than saying that it's invalid. Presumably, from the language, a line must be terminated by \r\n, hence no special treatment should be done for a\n.., however \r\n.. should have the first dot removed.
In other words, dots should be removed if preceded by \r\n (and not a valid = before that), or the dot is the first character in the article/yEnc body (since it's implicitly at the beginning of a line).

I had a quick look at Nzbget's code and it doesn't appear to handle lines starting with a dot. Presumably SABnzbd didn't do this for a long time either? All yEnc decoders I've seen, apart from SABYenc don't handle this case (which isn't an issue if they go through some NNTP layer, but I have doubts that typically happens).
So is it really the case that most downloaders don't handle this, or am I missing something?

Interestingly, I haven't yet seen an yEnc encoder that doesn't escape dots which start a line. But there's obviously posts out there that have it.

I'll probably take the approach of stripping out all instances of \r\n., even though this is likely different to all existing implementations. It also seems difficult to do from a SIMD perspective.

This does also make incremental processing a little more difficult, as it may need to know the preceeding two characters to be correct.

Safihre commented 7 years ago

NZBGet does it will receiving the data in the NNTP layer: https://github.com/nzbget/nzbget/blob/develop/daemon/nntp/ArticleDownloader.cpp#L388-L392 NZBGet can have a worker thread per connection that decodes the stream as it comes in and do the check then, however, due to the Python GIL the overhead of thread-switching in Python is terrible and we cannot do this. Maybe when we switch to Python 3 and can use async/greenlet/etc approaches, but switching to Python 3 is still a long road for us.

I just checked some test-articles, and it seems that usually less than 1% of all 16-byte blocks contains a dot. So as a first 'solution' I was thinking of doing the check if there is a dot present in the 16-byte block, then revert back to char-by-char processing for the whole block and also the block before/after it (in case it was at the start or finish of the current block). It's a bit convoluted, but it would allow still a vast majority to be done via SIMD. I tried to implement something of the sorts, but so far not much luck.

EDIT: All usenet downloaders must correct for this, 60% of posts I tried have the \r\n.. at least once.

animetosho commented 7 years ago

I see, thanks for pointing that out!

I think a SIMD version should be fast enough to make threading not that needed. I mean, this yEnc library is single threaded and synchronous, as I feel that it's fast enough to not really warrant the need (I'll probably eventually add async support though).

I've committed up a very rough implementation. From my quick testing, it seems to work and handles all edge cases I've tried so far. Still needs work though, including more testing, tidying up, optimizations etc. I've just pushed it up in case you're interested. Unfortunately the algorithm is a bit convoluted, but maybe can be streamlined better.

The algorithm supports incremental processing by supporting 4 possible previous states (preceeding character is a =, \r, \r\n or something else).

Quick speed test of the current code on a 3.8GHz Haswell CPU, using random input data:
Scalar version: ~400MB/s
SIMD version: ~1700MB/s

Once this is tidied up, maybe I can write up something to help others port the code across (though it's mostly copy the functions across, a few defines, and lookup-table generation code).

Safihre commented 7 years ago

Coool! 💯 I will need to take a closer look and dissect it a bit so I understand it well!

sanderjo commented 7 years ago

Wow, very impressive & cool.

I wanted to ask a noob question: is _mm_movemask_epi8() generic Intel SIMD / SSE, or is it Microsoft specific (because most Google hits are to microsoft.com)? ... But I can answer that question myself:

$ cat /usr/lib/gcc/x86_64-linux-gnu/6/include/emmintrin.h | grep _mm_or_si128
_mm_or_si128 (__m128i __A, __m128i __B)

So: also on Linux, so generic Intel SIMD/SSE. :+1:

animetosho commented 7 years ago

The Intel Instrinsics Guide (obviously lacks AMD extensions) is a handy reference. The movemask instruction is one useful instruction SSE has over ARM NEON.

I just realized a mistake: I've been treating =\r\n as a non-newline sequence, but thinking about it, this is incorrect because the NNTP layer doesn't care about yEnc's escaping semantics. So a sequence like =\r\n. should have its dot removed (NNTP protocol), then yEnc kicks in and unescapes the \r, so it should be converted to \xA3 instead of \xA3.. I'll need to fix that up.

hugbug commented 7 years ago

Super interesting topic 👍

I've tried the SSE decoder but for me it produces incorrect output. Am I'm doing something wrong or is there a bug in the code?

I've made a test program to illustrate the issue:

curl https://gist.githubusercontent.com/hugbug/fd7d95d53e3a2ca4aafb0f811d929bfc/raw/457d22331669372bd6c92bab812730e2d18338be/decoder_test.cpp > decoder_test.cpp

g++ -std=c++11 decoder_test.cpp

./a.out

Testing scalar

Test 1: OK

Test 2: OK

Testing sse

Test 1: OK

Test 2: FAILURE
Source:
0f 1a b1 a4 0c d4 15 2a 61 47 c8 bf bf e4 d3 e9 e2 b2 1f e0 99 1d 79 9a 38 26 c0 8b 3d 40 42 cf c6 f9 85 34 8d 9c f2 55 ce 16 ec 4d 38 29 3d 4d 22 d8 bb cc ce 2a 91 c9 93 87 6f 0f fb 5b 2c d7 90 3c 22 4a ac a7 1a 57 1a bb 6b 64 23 e0 87 8f b2 3d 7d 94 30 c5 eb 2f cb e5 78 35 8e bc d0 0b 57 15 58 69 e3 9d fc f3 da 6b c1 07 3d 4d d2 6a 60 6f 43 a4 3d 4a 81 dc b7 ca 04 8a c1 f6 8d b8 
Expected:
e5 f0 87 7a e2 aa eb 00 37 1d 9e 95 95 ba a9 bf b8 88 f5 b6 6f f3 4f 70 0e fc 96 61 d6 18 a5 9c cf 5b 0a 63 72 c8 2b a4 ec c2 23 0e ff e3 f8 ae 91 a2 a4 00 67 9f 69 5d 45 e5 d1 31 02 ad 66 12 f8 20 82 7d f0 2d f0 91 41 3a f9 b6 5d 65 88 13 6a 06 9b c1 05 a1 bb 4e 0b 64 92 a6 e1 2d eb 2e 3f b9 73 d2 c9 b0 41 97 dd e3 a8 40 36 45 19 7a e0 57 b2 8d a0 da 60 97 cc 63 8e 
Result:
e5 f0 87 7a e2 aa eb 00 37 1d 9e 95 95 ba a9 bf b8 b8 b8 b8 b8 b8 b8 b8 0e 0e 0e 0e 0e 0e 0e 9c 9c 9c 9c 9c 9c 9c 9c a4 a4 a4 a4 a4 a4 a4 f8 ae 91 a2 a4 00 67 9f 69 5d 45 e5 d1 31 02 ad 66 12 f8 20 82 7d f0 2d f0 91 41 3a f9 b6 5d 65 88 88 88 88 88 88 88 a1 a1 a1 a1 a1 a1 a1 a1 2d 2d 2d 2d 2d 2d 2d 2d b0 b0 b0 b0 b0 b0 b0 36 45 19 7a e0 57 b2 8d a0 da 60 97 cc 63 8e 

NOTE: My system doesn't have __POPCNT__.

Another issue is that the code fails to compile if __SSSE3__ is undefined.

animetosho commented 7 years ago

Thanks for the comment! Yeah, the SSE2 path was broken. I've pushed up my changes that I have been working on, which should have that fixed.

As for your test failures, the SSE code requires some lookup tables to be generated before running the main function. In the old code, this was done in the node initialization function, in the new code, there's a decoder_init function which will do that. (alternatively, you could hard code the lookup tables; I just prefer dynamic generation because it's easier to maintain)

I've modified your program to use the new code. You'll need to copy over common.h, decoder.cc and decoder.h from the src directory over as well. Compiles via g++ decoder_test.cpp decoder.cc -O2 -march=native

sanderjo commented 7 years ago

Cool, that works for me:

git clone https://github.com/animetosho/node-yencode.git
cd node-yencode/
git pull
cd src
curl https://gist.githubusercontent.com/animetosho/5012484c08204bd744be475e78131247/raw/af29e30a3f243f8ce46621322903b69aa6d2f7ad/decoder_test.cpp > decoder_test.cpp
g++ decoder_test.cpp decoder.cc -O2 -march=native -o decoder_test
./decoder_test 

with result:

$ ./decoder_test 
Testing scalar

Test 1: OK

Test 2: OK

Testing sse

Test 1: OK

Test 2: OK
hugbug commented 7 years ago

I've tested the SSE decoder in NZBGet.

Conditions

Test case 1: decoding off

Article decoding was disabled via compiler define SKIP_ARTICLE_DECODING. That test case represents the ideal goal. It shows how fast the program can download if the yEnc-decoding would not take any time at all.

Test case 2: classic decoder

That test case shows how fast the program can download using current yEnc-decoder.

Test case 3: SSE decoder

Now we use the SSE-decoder. With SSSE3 but without __POPCNT__.

Test case 4: improved classic decoder

During experiments I've realised that I can improve current (classic) decoder. It checks for \n and \r although these characters can never occur in input because NZBGet processes downloaded data line per line. Therefore I've modified the decoder by removing the checks and that allowed to transform switchstatement into one if.

Test case 5: CRC check off

This test is for reference: both decoding and CRC check disabled.

Test results

The numbers are download speed (MB/s) and download time (s). The download speed is calculated based on download time in seconds which is an integer value; fractions of seconds can not be measured here unfortunately.

Test case Intel 10GB file ARM 2GB file
decoding off 306MB/s (34s) 69MB/s (30s)
classic decoder 193MB/s (54s) 59MB/s (35s)
SSE decoder 281MB/s (37s) -
improved classic 260MB/s (40s) 65MB/s (32s)
decoding off, crc check off 335MB/s (31s) 74MB/s (28s)
hugbug commented 7 years ago

Update: I've checked CPU capabilities using Intel tool and it showed that CPU supports POPCNT. After recompiling with "-march=native" the define __POPCNT__ was active. Interesting, the speed hasn't improved and actually after multiple test runs it seems the version without POPCNT performs even a little bit faster.

As a side note: Now, knowing the CPU also supports PCLMULQDQ I can try the fast CRC routine, which didn't compile before.

Safihre commented 7 years ago

@hugbug Did you use @animetosho version or a modified one for SSE? Since in NZBGet you don't have to deal with \r\n or the \r\n.. problem in the decoder and that part/checks of the code could be left out?

hugbug commented 7 years ago

The SSE decoder can be parametrized to work in raw mode (where it processes dots) or in clean mode. I used the latter. It still filters out \r and \n but that happens at the same time with processing = and doesn't add much overhead. I've tried removing that part but that didn't change the result.

I think the SSE decoder can't show it's best due to line by line processing in NZBGet. With typical line length of 128 bytes that's the portion the decoder deals with. The lookup tables are probably never in CPU cache.

Although I must say I also tested with a post encoded with 2048 bytes per line. All decoders speed up considerably but the difference remains similar.

I suppose the SSE decoder can perform much better on larger inputs. I guess in SAB you can feed it with the whole article body (~500KB) and it should show its strengths.

My attempt was to replace the decoder without much rework in NZBGet and I wanted to share the results.

Also take into account that I tested on one CPU only. The results may be different on other (newer) CPUs.

animetosho commented 7 years ago

I'm guessing your x86 CPU there is a Core i5 520M. I can't tell what the ARM CPU is from the details given.

The decoder definitely should work better on larger inputs, as the overhead from start/end alignment gets amortized more, so it isn't really designed for ~128 byte inputs. The fact that it can handle newlines as well as dot unstuffing means you don't have to do it elsewhere, which is another speed benefit.
Note that ignoring newlines isn't exactly the same behaviour if given badly formed data such as foo\nbar\rbaz (unless your newline splitting logic handles that), and the SSE algorithm is designed to give identical output to typical decoders by stripping all \r and \n characters.

I haven't done enough investigation into whether using the POPCNT instruction is worth it or not. As we only need to count bits in 8-bit integers, a lookup table is quite fast for that purpose. My intuition is that it should be about even, possibly slightly slower, on Intel CPUs (1 POPCNT/clock vs 2 loads/clock), likely faster on AMD Jaguar/Ryzen, probably slower on AMD Bulldozer family. Removing one instance of POPCNT may be beneficial (only running it across the 16-bit mask instead), and/or maybe expanding the pshufb_combine_table lookup to take in part of the mask instead of a count may have some positive effect.
Thanks to your info though, I've decided to disable that path for most CPUs. Also, thanks for posting the results!


I think I've got in most of the optimizations now, and can't think of many ideas on how to change the algorithm. So some quick speed tests:

Tested by encoding 750KB of fixed random data into yEnc, then decoding that data and timing the decode. There'll obviously be some variation depending on how much escaping the random data generates. Measured speed is relative to the decoded output produced (not input to the decoder).
Compiled using GCC version 5/6.

Cleaned decoder:

Intel Ivy Bridge @3.8GHz (i5 3570S): 2007 - 2118 MB/s Intel Haswell @3.8GHz (E3 1231v3): 2235 - 2398 MB/s
Intel Silvermont @2.6GHz (Atom C2750): 697 - 700 MB/s
AMD K10 @2.9GHz (Athlon II 245): 547 - 619 MB/s

Raw decoder:

Intel Ivy Bridge @3.8GHz (i5 3570S): 1943 - 1999 MB/s Intel Haswell @3.8GHz (E3 1231v3): 2003 - 2114 MB/s
Intel Silvermont @2.6GHz (Atom C2750): 587 - 589 MB/s

animetosho commented 7 years ago

I've ported the algorithm to use ARM NEON. I've been wondering whether it'd actually be faster or not, considering limitations of most ARM CPUs, but on my Cortex A53 running in armv7 mode, decoding does seem to be about 2x faster than my scalar code. Encoding actually gets more of a speed boost, most likely because there's less scalar <-> NEON interaction.

There's a number of issues to consider with ARM/NEON though:

The code is otherwise basically identical to the SSE code, just uses NEON instructions instead.
I don't know whether NEON is worthwhile across all ARM CPUs. It should be good on more powerful ARM cores, but older/weaker cores may not benefit as much, or at all.

hugbug commented 7 years ago

Thanks a lot!

In the meantime I've done many tests for performance optimisations in NZBGet including SSE decoder and SSE crc routine. I'm currently in process of integrating them for good.

Producing binaries that work on all systems was the difficult part. I cannot compile the whole program with -march=native as it will not work on older systems. The solution I currently implementing is compiling decoder-unit with additional flags -msse3 -mssse3 and crc_fold-unit with -msse4.1 -mpclmul whereas the all other units are compiled with default architecture (i686 or x86_64).

I was just about to ask you regarding ARM support but you have that already. Just in time! 🥇

I'm absolutely need to do runtime feature check on ARM too. I'm planning to achieve this by parsing /proc/cpuinfo on Linux. I don't support ARM on other platforms anyway.

My ARM device doesn't have crc but it has neon. I'll do tests for neon-decoder there.

sanderjo commented 7 years ago

@hugbug

I'm planning to achieve this by parsing /proc/cpuinfo on Linux.

Please note: ARM64 / Aarch64 always has NEON, but does not mention that in /proc/cpuinfo:

sander@nanopineo2:~$ cat /proc/cpuinfo
processor   : 0
Processor   : AArch64 Processor rev 4 (aarch64)
Hardware    : sun50iw1p1
BogoMIPS    : 48.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0xd03
CPU revision    : 4

I believe that when compiling, you should NOT mention "neon". I'll check.

... about compiling (#1)

I tried https://github.com/animetosho/node-yencode/issues/4#issuecomment-331716855 on my ARM64, but that didn't work: see errors below.

The GCC on my ARM64 is older (5.4.0 20160609) than on my x86-ubuntu (6.3.0 20170406) ... is that the cause?

sander@nanopineo2:~/git/node-yencode/src$ g++ decoder_test.cpp decoder.cc -O2 -march=native -o decoder_test
decoder_test.cpp:10:7: error: expected nested-name-specifier before ‘decode_func’
 using decode_func = size_t(*)(const unsigned char* src, unsigned char* dest, size_t len, char* state);
       ^
decoder_test.cpp:24:1: error: in C++98 ‘tests’ must be initialized by constructor, not by ‘{...}’
 };
 ^
decoder_test.cpp:24:1: error: could not convert ‘{{"7c 8b 9c 4b 44 31 2a 84 98 9d 3b 2b 37 2a 2a 2a 2a 2a 2a 2a dd b5 9e 4c bb 78 2a b4 29 69 30 2a 2a 8a ed 2c 8a 04 14 98 1e 8c 2e 73 3e 5a 4b 2a 4a 4a 2a 2a 2a 2a 2a 2a 2b 2a 2a 2a 79 9a 8f 98 6b 7e 80 57 8c 9f 93 96 8e 57 8c 99 a2 57 76 93 98 9f a2 57 8e 93 9d 95 5b 58 a0 8e 93 2a 1a 7d e6 a2 66 66 66 4a 79 9c 8b 8d 96 8f 4a 80 77 4a 80 93 9c 9e 9f 8b 96 6c 99 a2 4a 6e 93 9d 95 4a ", "52 61 72 21 1a 07 00 5a 6e 73 11 01 0d 00 00 00 00 00 00 00 b3 8b 74 22 91 4e 00 8a ff 3f 06 00 00 60 c3 02 60 da ea 6e f4 62 04 49 14 30 21 00 20 20 00 00 00 00 00 00 01 00 00 00 4f 70 65 6e 41 54 56 2d 62 75 69 6c 64 2d 62 6f 78 2d 4c 69 6e 75 78 2d 64 69 73 6b 31 2e 76 64 69 00 f0 53 bc 78 3c 3c 3c 20 4f 72 61 63 6c 65 20 56 4d 20 56 69 72 74 75 61 6c 42 6f 78 20 44 69 73 6b 20 "}, {"0f 1a b1 a4 0c d4 15 2a 61 47 c8 bf bf e4 d3 e9 e2 b2 1f e0 99 1d 79 9a 38 26 c0 8b 3d 40 42 cf c6 f9 85 34 8d 9c f2 55 ce 16 ec 4d 38 29 3d 4d 22 d8 bb cc ce 2a 91 c9 93 87 6f 0f fb 5b 2c d7 90 3c 22 4a ac a7 1a 57 1a bb 6b 64 23 e0 87 8f b2 3d 7d 94 30 c5 eb 2f cb e5 78 35 8e bc d0 0b 57 15 58 69 e3 9d fc f3 da 6b c1 07 3d 4d d2 6a 60 6f 43 a4 3d 4a 81 dc b7 ca 04 8a c1 f6 8d b8 ", "e5 f0 87 7a e2 aa eb 00 37 1d 9e 95 95 ba a9 bf b8 88 f5 b6 6f f3 4f 70 0e fc 96 61 d6 18 a5 9c cf 5b 0a 63 72 c8 2b a4 ec c2 23 0e ff e3 f8 ae 91 a2 a4 00 67 9f 69 5d 45 e5 d1 31 02 ad 66 12 f8 20 82 7d f0 2d f0 91 41 3a f9 b6 5d 65 88 13 6a 06 9b c1 05 a1 bb 4e 0b 64 92 a6 e1 2d eb 2e 3f b9 73 d2 c9 b0 41 97 dd e3 a8 40 36 45 19 7a e0 57 b2 8d a0 da 60 97 cc 63 8e "}}’ from ‘<brace-enclosed initializer list>’ to ‘std::vector<test_pair>’
decoder_test.cpp: In function ‘std::vector<unsigned char> unhex(const char*)’:
decoder_test.cpp:32:35: error: ‘strtol’ was not declared in this scope
   long val = strtol(p, &endptr, 16);
                                   ^
decoder_test.cpp: In function ‘void print_vector(const char*, std::vector<unsigned char>)’:
decoder_test.cpp:40:23: error: ‘printf’ was not declared in this scope
  printf("%s:\n", title);
                       ^
decoder_test.cpp:41:26: warning: range-based ‘for’ loops only available with -std=c++11 or -std=gnu++11
  for (unsigned char ch : vec)
                          ^
decoder_test.cpp: At global scope:
decoder_test.cpp:48:11: error: variable or field ‘test’ declared void
 void test(decode_func func)
           ^
decoder_test.cpp:48:11: error: ‘decode_func’ was not declared in this scope
sander@nanopineo2:~/git/node-yencode/src$

Am I doing something wrong?

UPDATE More compiling (#2):

After upgrading to gcc 6.3.0 20170519, less errors messages, but still two fatal:

sander@nanopineo2:~/git/node-yencode/src$ g++-6 decoder_test.cpp decoder.cc -O2  -o decoder_test
decoder_test.cpp:8:15: warning: ‘size_t do_decode_sse(const unsigned char*, unsigned char*, size_t, char*) [with bool isRaw = false; bool use_ssse3 = false]’ used but never defined
 static size_t do_decode_sse(const unsigned char* src, unsigned char* dest, size_t len, char* state);
               ^~~~~~~~~~~~~
decoder_test.cpp:6:15: warning: ‘size_t do_decode_scalar(const unsigned char*, unsigned char*, size_t, char*) [with bool isRaw = false]’ used but never defined
 static size_t do_decode_scalar(const unsigned char* src, unsigned char* dest, size_t len, char* state);
               ^~~~~~~~~~~~~~~~
/tmp/ccRoCy9t.o: In function `main':
decoder_test.cpp:(.text.startup+0x38): undefined reference to `unsigned long do_decode_sse<false, false>(unsigned char const*, unsigned char*, unsigned long, char*)'
decoder_test.cpp:(.text.startup+0x3c): undefined reference to `unsigned long do_decode_sse<false, false>(unsigned char const*, unsigned char*, unsigned long, char*)'
collect2: error: ld returned 1 exit status
sander@nanopineo2:~/git/node-yencode/src$ 
hugbug commented 7 years ago

In the test app change do_decode_sse to do_decode_neon.

animetosho commented 7 years ago

@hugbug sounds interesting! A note that I assume that all x86 CPUs support SSE2 as I don't bother supporting dynamic dispatch if they don't. This isn't an issue with x86-64, as all 64-bit CPUs are required to support SSE2. So unless you plan to implement dynamic dispatch for non-SSE2 code, you may as well compile your entire application using -msse2.
I think assuming SSE2 availability (introduced in Intel Pentium 4 and AMD K8/Athlon64) is reasonable and a fair amount of modern software requires it. From memory, Windows 8, Firefox 49 and Chrome 35 won't run on CPUs without SSE2 support.
Oh and -msse3 isn't required (we don't use any instructions from SSE3, and SSSE3 implies SSE3 anyway).

I suspect /proc/cpuinfo on ARM won't tell you the CPU uArch unfortunately, which can be a problem if the NEON routine ends up being slower on specific ARM CPUs.

but does not mention that in /proc/cpuinfo:

I believe it's named asimd there, but AArch64 always supports it, so there's no need to attempt detection for an AArch64 build (you still may wish to check for crc32 however).


Testing on a Cavium ThunderX ARMv8 from Scaleway, I get roughly a 2x speed increase with the NEON decoder over scalar. It seems that SIMD is generally weaker than the Cortex A53, relative to scalar code, but I'm struggling to find much info on the CPU.

Safihre commented 7 years ago

I think me and my limited C knowledge will just wait and see what you @hugbug make of it, maybe I can then start thinking how to integrate it in sabnzbd decoding. Should have taken that C++ course at university 🤔

hugbug commented 7 years ago

@Safihre, I'm afraid you won't profit much from nzbget implementation, as I have stripped down the original version by removing parts which nzbget doesn't need but SAB probably needs (raw mode).

Although the original version is a C++ module you should be able to convert it to plain C relatively easy. The only obvious C++ specific I see there is usage of templates, for example:

template<bool isRaw>
size_t do_decode_scalar(const unsigned char* src, unsigned char* dest, size_t len, char* state) {

In C++ that creates two versions of routine do_decode_scalar at compile time (the names are fictional, for illustration purposes):

  1. do_decode_scalar_raw_true
  2. do_decode_scalar_raw_false

In C you can manually copy and paste the function with two different names. In first routine (for raw=false) you remove the code inside if(isRaw) altogether and in second you replace if(isRaw) with if(true).

If you don't need non-raw function in SAB just leave the raw-version of the function.

Similarly for

template<bool isRaw, bool use_ssse3>
size_t do_decode_sse(const unsigned char* src, unsigned char* dest, size_t len, char* state) {

You would manually create four functions. Again, if you don't need raw-version then create only two (use_ssse3=false and use_ssse3=true).

Then in init-function instead of

void decoder_init() {
#ifdef __SSE2__
    _do_decode = &do_decode_sse<false, false>;
    _do_decode_raw = &do_decode_sse<true, false>;
...
#ifdef __SSSE3__
    _do_decode = &do_decode_sse<false, true>;
    _do_decode_raw = &do_decode_sse<true, true>;

you can write

void decoder_init() {
#ifdef __SSE2__
    _do_decode = &do_decode_sse_clean_nossse3;
    _do_decode_raw = &do_decode_sse_raw_nossse3;
...
#ifdef __SSSE3__
    _do_decode = &do_decode_sse_clean_ssse3;
    _do_decode_raw = &do_decode_sse_raw_ssse3;
hugbug commented 7 years ago

@animetosho I provide binaries for i686 and x86_64. Most current systems use x86_64 whereas i686 is mainly for older systems. I don't know if all of them have SSE2. I can imagine NAS devices built with AMD or Transmeta or more exotic chips which don't have SSE2. I'm going to add SSE2 run time check.


Unrelated: Is there a way to make crc_fold accept an initial CRC as extra parameter like most other routines can?

animetosho commented 7 years ago

I had a bit of a look at SABYenc and it appears to also process data line by line - I think you pass in an array of lines which represent the entire article, if my quick skimming of the code is correct.
If so, then the approach NZBGet uses should be similar. Since the data is done line by line, you don't really need the raw mode decoder, since just checking whether the first character is a dot is enough.

Does it have to be in plain C? (I don't really get C++ myself, and prefer to write in C style, but templates just happens to be very useful here)


Is there a way to make crc_fold accept an initial CRC as extra parameter like most other routines can?

In theory, it should be possible to calculate state from an existing CRC32, but I'm not confident with the mathematics behind it. crc_fold can be done incrementally though, as it holds its state in 4 XMM values - if you need incremental support, it's probably better (more efficient) to go that route rather than converting a CRC32 into the internal state. It may take some effort to do though.
Alternatively, you can take my approach and just combine two CRC32s into one.

Safihre commented 7 years ago

The input for SABYenc is chunks of data from the socket, so not lines. Whenever the socket reports data can be read, we read it and put it in a python list until we reach end-of-article. Then this list of chunks we give to sabyenc. This is so we can do all the work inside the C-module (faster), where we can turn off Python's GIL, so we have actual multithreading. Usually the list has around 30 chunks for each 700K article. On each chunk (except the last 1 or 2) we can then run the SIMD decoder that also takes care of the \r\n.. thing. So should be feasible. Hope to also see how much it improved things in NZBGet, or are these already the final results @hugbug? https://github.com/nzbget/nzbget/issues/448#issuecomment-332311380

hugbug commented 7 years ago

I've added support for memory cache to NServ which slightly improved overall performance in tests (since NServ runs on the same computer and consumes CPU) but the proportions remained.

To estimate the possible performance improvement I suggest you to disable decoding in SABYEnc and measure the speed. You can't get more speed with SSE decoder than without decoder at all.

animetosho commented 7 years ago

The input for SABYenc is chunks of data from the socket

Oh I see.
I see that it tries to search for the =yend sequence - if you simply parse the last few chunks for this marker (you probably only need about 1KB of data to search through), I presume you don't need the check in the main loop?

If you can simplify the main loop down, removing the end check from it, you could then replace your main loop with one that just loops through chunks and feeds it to the decoder.
Note that I don't search for null terminators in strings - the code relies on lengths (both faster and more reliable). If the Python API gives you a length, without needing to scan the string (a la strlen), this shouldn't be much of an issue.

I've wondered whether I should add support for detecting terminator sequences (\r\n.\r\n and \r\n=yend), as it'd make sense for cases when you directly feed socket data through, rather than have to buffer it up and scan for those elsewhere. It just makes the state machine of an incremental decoder a little tricker.

hugbug commented 7 years ago

@animetosho Is it safe to assume that compiler will not emit any SIMD code on its own?

If I compile decoder unit with -mssse3 (which is necessary to use SSSE3 intrinsics) can I be sure that compiler will not emit any SSE3/SSSE3 code into routines which are for use when SSSE3 isn't available? It would be fatal if do_decode_scalar or decoder_init end up having SSSE3 opcodes because compiler was smart enough to use certain optimisations.

The situation is even more problematic in case with ARM. The CRC-unit for ARM must be compiled with -march=armv8-a+crc. If I put CPU detection routine into the same unit the compiler may use ARMv8 opcodes in that routine although the rest of the program is compiled for ARMv7 and the CRC init routine must work on ARMv7.

To solve that issue I'm putting CPU detection code into a separate unit compiled with default settings.

I wonder if I should put SSSE3-decoder into a separate unit to make sure that SSE2-decoder is compiled with -msse2, without -mssse3.

And by the way, is SSSE3 decoder faster in your tests? In my limited tests I see no difference between SSE2 and SSSE3 decoders. Maybe I should keep only SSE2 version; that would make things easier.

animetosho commented 7 years ago

Technically, -mssse3 tells the compiler it can use any feature up to SSSE3 anywhere it likes. This means that the proper way to deal with CPU targeting is to split everything into separate files and dynamically dispatch to the correct function.

It's a pain to do this though. In practice, (S)SSE3 additions are quite purpose specific, and intrinsics are just wrappers around single assembly instructions, so compilers generally follow them. As such, I can't see the compiler using SSSE3 instructions in the SSE2 version, even if you compile with -mssse3, though one could argue that a really clever compiler could eventually do so.
So yeah, if you want to be truly safe, you'll have to split it up. For me, I only provide Windows binaries (I can't get static nodejs builds working for other platforms). MSVC is interesting in that it only goes up to SSE2 (before going to AVX teritory) but you can still write SSSE3 intrinsics. This means the compiler itself will never use more than SSE2, but of course will use any SSSE3 code you specifically write. I don't know whether GCC/Clang has a similar feature - it would be nice if it did.

I'm not sure whether the same can be said with SSE2 over i686 as SSE2 is somewhat more general purpose.

I get significantly faster speeds for SSSE3 (like 1380MB/s SSE2 vs 2200MB/s SSSE3 on a modern Intel CPU). It only differs when removing characters though, so in your case, since you'll never get \r or \n characters, the only characters to remove at the = ones, which won't happen as often. Hence, you probably won't see as much of a speed difference, because the SSSE3 code won't be invoked that frequently.

To my understanding, 32-bit ARMv8 is largely the same as ARMv7, but the documentation lists some differences so it can't really be relied on unfortunately. Though I'm running in armv7 mode, and code compiled using armv8-a+crc seems to work... maybe it's because the CPU can understand the ARMv8 stuff.

I have to say that compiler tooling/support for dynamic dispatch is rather primitive, and a bit of a pain to deal with.

hugbug commented 7 years ago

Found the relevant GCC doc:

If you specify command-line switches such as -msse, the compiler could use the extended instruction sets even if the built-ins are not used explicitly in the program. For this reason, applications that perform run-time CPU detection must compile separate files for each supported architecture, using the appropriate flags. In particular, the file containing the CPU detection code should be compiled without these options.

So, yeah, I have to split up SSE2 and SSSE3 into separate units. Not a big deal, I've already put initialization part into a separate unit. I guess I have to eliminate the extra optimizations with tune- and AVX- ifdefs though.


I'm currently in process of reworking code which fetches data from news server and feeds it into decoder. My plan is to eliminate own line detection and use raw-decoder instead (on large blocks like 10KB). That saves data copying and allows decoding of data directly from buffer which receives from socket. All doing in one pass instead of two (line detection, then decoding).

There is however one thing that stops me at the moment. The decoder must detect end of yEnc-data or end of article. Currently I need to tell the decoder where to stop using len but that means I have to scan the data first which is an additional pass.

In yEnc end of data is marked with \r\n=yend. More generally yEnc specifies =y as reserved sequence; if that's easier to detect (I guess so) that would be sufficient. In addition decoder must detect end of NNTP article which is marked with \r\n.\r\n; this is necessary for a case of malformed posts not having proper yEnc end-markers.

Once detected end of data (yend or article end), the decoder should stop processing and should report that (for example via new state value). It would also be helpful if it could report back the end position since we need to process the yenc-trailing data (parse CRC). It's not necessary for decoder to deal with =ybegin or =ypart markers as this part can and should be done by caller, before the decoding process starts.

if you simply parse the last few chunks for this marker (you probably only need about 1KB of data to search through), I presume you don't need the check in the main loop?

This doesn't work unfortunately because I can't detect the last chunk on transmission level. Once the news server sent the last chunk my attempts to receive more data result in hanging as no data is available (and the server doesn't close connection waiting for further commands). That's why I need to scan the received data for end of article marker before attempting to receive more data.

Do you think you could extend the decoder with end of stream detection? That'd be great and highly appreciated.

If implemented we would be able to do all the heavy number crunching using SIMD in one pass. Well, in two passes actually, as we need to calculate CRC of decoded data separately. Theoretically decoding and CRC calculation could be made in one pass but that would be too much of efforts I guess, especially the Intel routine isn't designed for such use. So I'm not asking about this, just thinking out loud.

Thanks again. This discussion topic is a great pleasure to participate in.

animetosho commented 7 years ago

Yeah, detecting terminator sequences is something I've been wondering about. I didn't realize that =y was considered reserved - reading the yEnc specification, it appears that \r\n=y is considered a 'keyword line', in which case, it would make sense to stop there. I previously thought that \r\n=yend would need to be detected, which is a little trickier to maintain state in an incremental decoder (and potentially deal with sequences like \r\n=yenb), but \r\n=y will be easier.

So there'll need to be a second version of the function, with a different signature (since input may only be partially consumed) which stops at the end point. Non-raw decoder will only look for \r\n=y whereas the raw will also look for \r\n.\r\n (or maybe just \r\n.\r?).

This doesn't work unfortunately because I can't detect the last chunk on transmission level

Yes, this only works if the end sequence has been scanned elsewhere. SABYenc seems to take that approach.

Theoretically decoding and CRC calculation could be made in one pass but that would be too much of efforts I guess, especially the Intel routine isn't designed for such use.

I've been thinking about the possibility of such optimizations via function stitching. It's a little awkward because yEnc is variable length, and may require the algorithm to go back to memory, though I think it could still yield a net benefit. It's a little complicated to implement/maintain, and I don't know how much it'd yield over running it separately (perhaps with loop tiling). It's something I may look into eventually nonetheless.

This discussion topic is a great pleasure to participate in.

Same here! I'm glad to be a part of it!

hugbug commented 7 years ago

Non-raw decoder will only look for \r\n=y whereas the raw will also look for \r\n.\r\n (or maybe just \r\n.\r?).

Indeed, NNTP mandates \r\n as a line terminator sequence. I think we can safely assume that none of these characters appears separately in article content. Then looking for \n=y and \n.\r would be sufficient, if that makes decoding any easier.

perhaps with loop tiling

Regarding better CPU cache usage. If we choose block size large enough for efficient SIMD processing but small enough to remain in cache between decoder and CRC passes, we can process the data in two passes without unnecessary complicating decoder, right? I mean the caller should execute CRC function just after decoding and you don't have to deal with integrating CRC calculation into decoder. That's actually how it is currently processed in nzbget (although in very small one-line chunks).

The only thing we need is an incremental CRC function. The ARM function is already incremental. The crc_fold not yet. Linux kernel has a CRC function which also uses PCLMULQDQ but supports initial CRC parameter. Have you evaluated it yet? It's pure asm and that scares, not easily portable probably. From the amount of code it seems to be much smaller than crc_fold, probably not as sophisticated and not as fast.

animetosho commented 7 years ago

My current plan is to bail the SIMD decoder when it sees \r\n=y or \r\n.\r. The two character sequences are slightly easier to handle, since we already search for \r\n, we can just shift the data across 2 bytes and match them against the next two bytes.
When it sees the sequence, it'll break out into scalar code. The scalar code requires a \r\n.\r\n sequence to end though, so if you get a sequence like \r\n.\ra, it won't end, but the rest of the message will be processed slower. This stays strictly spec compliant, and I think should be acceptable.

Limiting processing size is useful for cache if multiple passes are necessary. Function stitching is still beneficial for a number of reasons though:

I'm not intending this decoder to be the fastest one can possibly be, I'm mostly looking for easy wins (and stitching CRC into the decoder is a little complex). A modern CPU can already run this at around 2GB/s per core, and PCLMUL CRC32 runs in excess of 10GB/s on one core of a Haswell CPU, which is a fair bit faster than what a 10Gbps connection would give you. I can't see exceeding 10Gbps to be an issue for a while, and one can always multi-thread.

The CRC folding approach is incremental, if you see the original code I took it from. The state isn't stored as a CRC32 though, rather it's stored in 512 bits which can be reduced to a 32-bit CRC.

I haven't tested the Linux code yet, thanks for linking to that. It looks like it largely uses the same idea. One could port this to C intrinsics if assembly is not desired.

hugbug commented 7 years ago

I'm currently in process of reworking code which fetches data from news server and feeds it into decoder. My plan is to eliminate own line detection and use raw-decoder instead (on large blocks like 10KB). That saves data copying and allows decoding of data directly from buffer which receives from socket. All doing in one pass instead of two (line detection, then decoding).

Done!

Since current version of raw decoder doesn't support end-of-stream detection yet, an extra scan of incoming data before feeding decoder is implemented to properly process data. Therefore it's not one pass at the moment.

Nonetheless the overall efficiency is greatly improved: 268 MB/s -> 350 MB/s.

Detailed benchmark results

Decoder CrcCheck Off CrcCheck On
new-decoder (per-line) 282 MB/s 245 MB/s
sse-decoder (per-line) 304 MB/s 268 MB/s
sse-decoder (raw mode) 404 MB/s 350 MB/s
scalar-decoder (raw mode) 329 MB/s 292 MB/s
no-decoder (raw mode) 449 MB/s 399 MB/s

Please note that these speeds represent overall download speed in NZBGet, not just decoding speed (the program of course has to do way more work in addition to decoding).

For test conditions see this topic.


I'm still using non SIMD CRC32 calculation routine (slice by 4). Improvements on that front is the next item on my list.

The CRC folding approach is incremental, if you see the original code I took it from. The state isn't stored as a CRC32 though, rather it's stored in 512 bits which can be reduced to a 32-bit CRC.

That's great news. I'll adopt it.

Safihre commented 7 years ago

Cool, nice to see my random thought bubble making this issue resulted in something usefull. If it's possible for NZBget, SABnzbd will also be able to take advantage of it :+1: All thanks to @animetosho!

@hugbug Did you possibly also have test results on ARM? Would be very interested to see how much Neon adds!

I've shifted my focus now to first converting SABnzbd to Python 3, so that it can also use VS2017 for compiling the C extensions (in 2.7 you're locked to VS2008) and use async stuff to also start decoding in chunks instead of having to use the current convoluted approach of list of chunks. SABnzbd 2.3.1 maxes out a gigabit connection using my old i5, so I need to buy a slower device to even start testing real world impact of these things 😄

hugbug commented 7 years ago

Reposting test results from NZBGet topic here as this is very much related to the discussion.

When making tests on two ARM devices discovered and fixed a performance bottleneck in NServ. The CPU usage of NServ has been greatly reduced, which in turn gives more CPU time to NZBGet and increases speed. All tests were rerun with improved NServ (including tests on Mac).

Test devices

1) MacBook 2010 with Intel Core i5-520 (2 cores @ 2.4 GHz, 4 threads), macOS 64 bit; 2) PVR with ARMv7 Cortex A-15 (2 cores @ 1.7GHz), Broadcom BCM 7251s, Linux 32 bit; 3) NanoPi NEO2 with ARMv8 Cortex A-53 (4 cores @ 1.5GHz), Allwinner H5, Linux 64 bit. NZBGet runs in 32 bit.

Test results

All numbers are in MB/s. For each decoder two test cases were measured - with and without CRC calculation; the latter is shown in parentheses. The overhead of CRC calculation shows how much improvement potential is still there - the CRC routine isn't optimised for SIMD yet.

Once again a reminder that the speeds below represent overall download speed in NZBGet, not just decoding speed.

Decoder MacBook 2010 PVR ARMv7 NEO2 ARMv8
new-decoder per-line mode 307 (354) 89 (99) 102 (118)
simd-decoder per-line mode 336 (404) 89 (100) 101 (119)
simd-decoder raw mode 467 (570) 99 (109) 121 (141)
scalar-decoder raw mode 369 (421) 93 (105) 107 (124)
no-decoder raw mode 544 (693) 111 (128) 145 (168)

Observations

animetosho commented 7 years ago

Thanks for those benchmarks - very interesting to know!

The ARM results are interesting. I don't really know ARM uArchs anywhere as well as I do x86, but I suspect lower performance on ARM cores which have a separate NEON unit (I believe older designs do this). But interestingly, from these slides, it seems that the Cortex A15 fully integrates NEON into the regular pipeline, so I'm a bit surprised with the lack of speed gain there. Perhaps this could mean that the SIMD decoder is slower than the scalar decoder on something like a Cortex A7.

I can't find much information on SIMD width on ARM chips. I believe the Cortex A57 and later use 128-bit units, so am guessing the A53 and A15 possibly use 64-bit units. That in itself will reduce the benefit of SIMD (half the throughput).

hugbug commented 7 years ago

On Wikipedia page Comparison of ARMv7-A cores Cortex A-12, A-15 and A-17 are listed as having 128 bit NEON.

What wonders me in benchmark results is that the performance difference between the slowest (new-decoder per-line mode crc on) and the fastest (no-decoder raw mode crc off) is only 29% (68 vs 88) on ARMv7 whereas on Intel it's 125% (307 vs 693). As if CPU spends far more time doing other things and therefore doesn't respond so well to optimisations in decoder. May be it's not CPU itself but other system components such as slow RAM, I don't know.

Safihre commented 7 years ago

@hugbug do you run NServ also on the ARM devices themselves? What if you run it on your Mac/Windows and then connect to the ARMv7 via Gigabit? Just to see where the bottleneck is. In the end it's of course real life performance, in a fully working setup, that counts.

animetosho commented 7 years ago

On Wikipedia page Comparison of ARMv7-A cores Cortex A-12, A-15 and A-17 are listed as having 128 bit NEON.

That's an interesting and useful comparison table.

Somewhat interestingly though, I just found this article which seems to imply that A15 is not 128b:

Architecturally, the Cortex A57 is much like a tweaked Cortex A15 with 64-bit support. The CPU is still a 3-wide/3-issue machine with a 15+ stage pipeline. ARM has increased the width of NEON execution units in the Cortex A57 (128-bits wide now?)

Similarly, the same site mentions that the A7 and A53 are single issue 64-bit units. Of course, the site could be wrong, but if it was true, it'd make a little bit more sense.

Your PVR benchmarks do seem a little odd still - I'd expect a 1.7GHz 3 wide out-of-order core to easily beat the 1.5GHz 2 wide in-order CPU at almost everything, but it seems to be significantly weaker. ARM claims that the A53 should be about as performant as an A9, but the A15 should be more powerful than an A9.

The NEO2 benchmarks make more sense to me, though your point about there being much less gain with everything disabled is certainly on the mark still.

connect to the ARMv7 via Gigabit?

Gigabit maxes out at 125MB/s, which I suspect could be a bit of a bottleneck (perhaps not for the PVR). Not exactly knowledgeable about the test server, but if it runs on a different core, hopefully it doesn't have much of an effect. Unless there's a memory bottleneck or similar - might be worth a try nonetheless (can get an idea for how network transfers affect the device).

hugbug commented 7 years ago

I'm testing in NZBGet which is multithreaded and for tests 10 connections was configured, meaning 10 threads all doing similar jobs (on different articles): downloading, decoding, computing CRC, then normally also writing into disk but that last part was disabled during tests.

Therefore 4 cores @ 1.5GHz is better than 2 cores @ 1.7GHz.

What if you run it on your Mac/Windows and then connect to the ARMv7 via Gigabit?

Tried that too and got similar results. Now process ksoftirqd takes roughly the same amount of CPU as NServ was taking (40% from total 200%) . My guess is that has something to do with ethernet chip.

Decoder PVR ARMv7 via Ethernet
new-decoder per-line mode 67 (74)
simd-decoder per-line mode 68 (74)
simd-decoder raw mode 73 (81)
scalar-decoder raw mode 71 (78)
no-decoder raw mode 82 (91)

When redoing tests on PVR I've noticed that I'm now getting lower CPU usage in NServ and better speeds in benchmark (97 MB/s in simd-decoder raw mode, crc on vs 72 MB/s in previous test). It seems in my previous test on PVR I was using an older version of NServ which didn't include the latest optimisation.

I'm redoing tests on PVR and will post updated results.

hugbug commented 7 years ago

Results for ARMv7 (NServ) updated directly in the benchmark post.

ARMv7 and ARMv8 can now be better compared because the same nzbget binaries and same NServ were used in all tests.

hugbug commented 7 years ago

If you are not tired of benchmarks here are numbers for SIMD CRC, which I've got just integrated (reposting from NZBGet issue tracker).

SIMD CRC

Integrated SIMD CRC routines for Intel and ARMv8 into NZBGet.

All numbers are in MB/s. For each decoder two test cases were measured - with and without CRC calculation; the latter is shown in parentheses. All tests were performed 4 times, the worst result was discarded and the average of the remaining three results were taken.

For convenience the table also includes all previous measurements with scalar-crc routine.

Decoder MacBook 2010 PVR ARMv7 NEO2 ARMv8
new-decoder per-line mode, scalar-crc 307 (354) 89 (99) 102 (118)
simd-decoder per-line mode, scalar-crc 336 (404) 89 (100) 101 (119)
simd-decoder raw mode, scalar-crc 467 (570) 99 (109) 121 (141)
scalar-decoder raw mode, scalar-crc 369 (421) 93 (105) 107 (124)
no-decoder raw mode, scalar-crc 544 (693) 111 (128) 145 (168)
simd-decoder raw mode, simd-crc 520 (563) n/a 136 (141)
scalar-decoder raw mode, simd-crc 397 (420) n/a 118 (125)
no-decoder raw mode, simd-crc 621 (684) n/a 165 (169)

Conclusion

animetosho commented 7 years ago

Thanks again for the benchmarks - they're quite interesting.

which is multithreaded

Oops, forgot about that, thanks for the correction!


I've gotten around to implementing the end scanning version of the decoder. Have been a little busy with other stuff, so took longer than expected.

Function signatures have changed to accommodate signaling how much input/output is consumed. I've also moved code around to be rather annoying because the interfaces have changed anyway. Return value signals what end sequence was hit. If it returns 0, then all input was consumed, otherwise *src is updated to point to the byte following \r\n=y or \r\n.\r\n.

Turns out that \r\n.=y needs to also be identified as a valid end sequence, due to the NNTP layer removing the dot, then the yEnc layer seeing \r\n=y. As such, I've decided to handle \r\n.\r\n accurately in the SIMD portion as well.

Searching for the end sequence is sometimes noticeably slower than not doing it, but hopefully faster than a second scan of the data.