enable avx512 support for base64 encoding. Reuse WojciechMula/base64-…

lucshi commented 2 years ago

Add AVX512 support for base64 encoding based on WojciechMula's code.

aklomp code base has been borrowed by Node.js upstream and I would like to enable avx512 support which can improve base64 encoding performance by additional +12% for node.js base64 on. The measurement of aklomp benchmark result also showed performance improvement for <100KB data, compared with AVX2. Improvements are around 36%-86%.

Testing with buffer size 100 KB, fastest of 10 * 100
AVX2    encode  17238.22 MB/sec
AVX2    decode  16882.58 MB/sec
AVX512  encode  23613.74 MB/sec
AVX512  decode  16260.03 MB/sec
Testing with buffer size 10 KB, fastest of 100 * 100
AVX2    encode  16852.97 MB/sec
AVX2    decode  16605.95 MB/sec
AVX512  encode  31398.70 MB/sec
AVX512  decode  16316.29 MB/sec
Testing with buffer size 1 KB, fastest of 100 * 1000
AVX2    encode  11344.30 MB/sec
AVX2    decode  10261.57 MB/sec
AVX512  encode  19500.83 MB/sec
AVX512  decode  10841.66 MB/sec

Above results were collected on Icelake cpu which is widely used in Amazon AWS EC2 cloud instances.

htot commented 2 years ago

Ordinarily, you would not benefit from the memory cache. That means that depending on your L3 cache size 10MB or 100MB would be the buffer size you need for the benchmark.

aklomp commented 2 years ago

Thanks for the pull request. I'll try to comment on it in the next few days. I like the idea of adding an AVX512 codec if @WojciechMula's code is under a compatible license. But there are a few things that I don't like in this pull request, such as copypasting code from Chromium and not updating the user documentation. I'll try to respond more substantially in the coming days.

lucshi commented 2 years ago

Hi htot, thank you for your prompt comments. I did not get your point exactly. Could you please kindly give more hints? Thanks!

htot commented 2 years ago

Hi htot, thank you for your prompt comments. I did not get your point exactly. Could you please kindly give more hints? Thanks!

The benchmark repeatedly encodes the same array of data. If the array fits in your (L3) cache you measure very high speeds that you would normally not see when encoding a single image once. I think most of our benchmarks were done with 10MB, however I now have a system with 16MB cache.

lucshi commented 2 years ago

Hi htot, thank you for your prompt comments. I did not get your point exactly. Could you please kindly give more hints? Thanks!

The benchmark repeatedly encodes the same array of data. If the array fits in your (L3) cache you measure very high speeds that you would normally not see when encoding a single image once. I think most of our benchmarks were done with 10MB, however I now have a system with 16MB cache.

L3 cache of each of CPU socket is 48MB (2 sockets are 96MB in total). Are you suggesting I should enlarge the benchmark data size bigger than 48MB/96MB, or just measure the first round performance to avoid cache benefit? I did not get your point why it would be a problem when benchmark data size is smaller than cache size. I thought L3 cache cannot make two algorithem with the same performance if the algorithms have performance gaps in nature.

htot commented 2 years ago

That's a lot of cache :-) I was lazy and just added a benchmark with 100MB. I'm note sure how the split L3 cache works. But in principle I would say that caching code but not the data is fair. Caching data can be useful to compare the speed of the algorithms excluding time to access memory.

lucshi commented 2 years ago

Hi @aklomp , I updated my patch and accepted your review comments of "adding readme and removing chromium code". Could you please have a review? Thanks! After it is landed it will be merged back into Node.js repo.

lucshi commented 2 years ago

Hi @aklomp , I'm wondering if you have bandwidth to review this PR. Thanks!

aklomp commented 2 years ago

@lucshi Sorry for the delay, I'm still here! Your pull request is near the top of my mental to-do list of things to process (along with #105), but I've been really busy the past weeks and haven't found a moment to deeply engage with it. I do plan on giving it a closer look soon. If you feel that I'm taking too long, don't hesitate to reach out.

lucshi commented 2 years ago

@lucshi Sorry for the delay, I'm still here! Your pull request is near the top of my mental to-do list of things to process (along with #105), but I've been really busy the past weeks and haven't found a moment to deeply engage with it. I do plan on giving it a closer look soon. If you feel that I'm taking too long, don't hesitate to reach out.

Thank you @aklomp for the response. I would like to make it landed if possible very soon to catch up with the Node.js upstream optimization. Do you require me to write some notes about the PR to make the review faster?

aklomp commented 2 years ago

@lucshi The main question I had was whether WojciechMula's code was under a compatible licence, but apparently it is (3-clause BSD).

Before merging, I also want to do a thorough code review to ensure that the code conforms to the library's general structure. It's this code review that I'm blocked on, but I'll try to put it on the fast path. I do feel bad for making you wait this long for a proper PR review. I really do appreciate the effort that you put in this, and it's definitely a very nice enhancement.

lucshi commented 2 years ago

@lucshi The main question I had was whether WojciechMula's code was under a compatible licence, but apparently it is (3-clause BSD).

Before merging, I also want to do a thorough code review to ensure that the code conforms to the library's general structure. It's this code review that I'm blocked on, but I'll try to put it on the fast path. I do feel bad for making you wait this long for a proper PR review. I really do appreciate the effort that you put in this, and it's definitely a very nice enhancement.

I'm totally OK for a careful code review. Per my understanding, BSD-3 is a most popular license nowadays and the code can be used freely in BSD-2 project as long as including the copyright text and the authors and projects names not be used in market promotion.

I'll also start the IP scan for the code base along with my PR by Protex IP Scanning tool to see if any other IP issues. Will update the results soon.

lucshi commented 2 years ago

There may be other IP related revise, but the IP scanning tool Protex did not found IP conflictions.

aklomp commented 2 years ago

Hey, I've informally reviewed the updated PR and I think it looks good on the whole. It's something I think can be merged, but it needs some work to clean it up. I'm volunteering to do that; more on that below. The main sticking points regarding the code I have are:

The commit history is not nice, clean, composable and atomic. The first commit makes all kinds of changes, the next commit undoes a whole bunch of stuff. The commits are also either way too large or way too small. IMO, a patchset should tell a story and introduce (or remove) bits of functionality in a way that's atomic. A good split here would be "add codec", then "add cmake and makefile support for avx512", then "update the test suite", then "update the readme", or something along those lines.
There are a lot of code style errors: trailing whitespace, indents with spaces instead of tabs, overlong lines, overlong commit messages, inconsistent code formatting, and so on.
There's code duplication: the AVX2 decoder is copypasted. I understand the reason why, but it should be changed before merging. The "nicer" way of doing the same thing is to #include "../avx2/dec_loop.c" in lib/arch/avx512/codec.c.
There are some minor things missing, such as adding Daniel Lemire and Wojciech Muła to the copyright section of the LICENSE file.

I'm giving you this short list instead of a full review because I don't want to come across as mean by pointing out boring details like trailing whitespace in overlong detail. I'm saving us the effort. But if you want though, I can certainly do a full review.

For the way forward, what I had in mind was that I would take your PR and rework it to fix the issues above, retaining you as the author, and then come back to you for permission to merge that new branch. I think it's faster and less frustrating (and less work for you!) to do it that way. Of course, if you prefer, we can also do it classically and I'll be the reviewer/oracle while you push corrections. What do you say?

Then there are two ephemeral issues that should also be mentioned:

Muła's code is under BSD3, my code is under BSD2. Are they compatible? I'm not a license expert, but BSD3 is less permissive than BSD2, so I would guess that they might not be compatible. Any thoughts on this?
CI is currently failing in a number of places, but that's due to the issue mentioned in #77. It's not something that I can fix now without rearchitecting the library, so I suggest we leave it as-is.

(Sorry for accidentally closing this issue, I clicked the wrong button!)

aklomp commented 2 years ago

I pushed a branch called avx512 which squashes your work on top of master and makes some cleanups/fixes. I didn't break this up into smaller atomic commits yet, but it gives you an idea of the direction I'm moving in.

I also set up Intel SDE locally to test the AVX512 code.. very good for building confidence in the code! :sweat_smile:

lucshi commented 2 years ago

Hi @aklomp , Thank you for the quick review comments. I'm OK to go with the fastest way that you do the fixes and made me one of the authors.

When everything is done, I will submit a PR on node.js and merge updated code.

lucshi commented 2 years ago

For the license issue. I have consulted the license expert in my company for the process of adding a BSD3 source file avx512.c into a BSD2 project.

The answer I got as below:

My understanding is that the bottom line is that you make sure the bsd-3 license text is in the avx512.c file.

aklomp commented 2 years ago

Again sorry for the delay. I intend to refactor the avx512 branch soon and merge it, so that we can both move on :)

Thanks for checking about the license. I agree with the proposed solution of putting the BSD-3 license text in the file with the imported code; that should indeed satisfy the license requirements.

I also created an issue, #106, to discuss moving this project from BSD-2 to BSD-3. It's a very small change that I think is for the better, and it will allow this project to import code from more locations. The one remaining question is if it would impact downstream users of this library, but I don't really think so because this library is generally included as a source bundle complete with license file.

aklomp commented 2 years ago

Regarding the license issue, I noticed that @WojciechMula has also published his AVX512 encoder in his base64simd project, specificially here. That library is under the BSD-2 license. That simplifies things, all we need to do is update the copyright year in Muła's mention in the LICENSE.

lucshi commented 2 years ago

Regarding the license issue, I noticed that @WojciechMula has also published his AVX512 encoder in his base64simd project, specificially here. That library is under the BSD-2 license. That simplifies things, all we need to do is update the copyright year in Muła's mention in the LICENSE.

Great！

aklomp commented 2 years ago

Alright, I pushed an avx512 branch that contains four commits, crediting you as the author, that implement the AVX512 encoder. The branch contains a number of relatively minor fixes I made to your PR; you can review them with a git diff.

Since your name is on these commits, I have to ask you: are you OK if I merge this?

lucshi commented 2 years ago

I'm OK to merge. Thank you!

aklomp / base64

enable avx512 support for base64 encoding. Reuse WojciechMula/base64-… #102