ende76 / brotli-rs

A Brotli implementation in pure and safe Rust
Apache License 2.0
64 stars 19 forks source link

Performance 10x worse than C implementation #24

Open danielrh opened 8 years ago

danielrh commented 8 years ago

I've been timing the C implementation versus the rust implementation and I generally notice about a factor of 8-10x difference.

Do you know offhand any obvious performance tradeoffs that were made in the design of this version?

Do you have any ideas about various strategies we could employ to bring it within a factor of two, or ideally to the same speed as the C version especially for multi-megabyte files?

I noticed no inline assembly in the C version, so I'm hoping it is possible to bring the rust version to parity.

Have you done any profiling of the existing code or compared it with the C code?

danielrh commented 8 years ago

perf record shows

  48.54%  main  main              [.] _ZN21Decompressor$LT$R$GT$10decompress19h833558899976614058E 
  32.45%  main  main              [.] _ZN7huffman4tree4Tree13lookup_symbol20h3239110865486455736E  
   8.84%  main  main              [.] _ZN9bitreader18BitReader$LT$R$GT$10read_exact20h9307349387484
   2.72%  main  libc-2.15.so      [.] __memcpy_ssse3_back                                          
   1.69%  main  main              [.] _ZN4main20h236efd01b33cc363UdaE                              
   1.13%  main  main              [.] _ZN13brotli..State9drop.582217h2d226c43866a0306E             
   1.05%  main  [unknown]         [k] 0xffffffff81391b17                                           
   0.53%  main  main              [.] _ZN3vec12Vec$LT$T$GT$19extend_with_element20h4407735263573563

As compared with the C version

  90.61%  secbrot  secbrot           [.] ProcessCommands
   4.63%  secbrot  [unknown]         [k] 0xffffffff81391b17
   1.24%  secbrot  secbrot           [.] __intel_avx_rep_memcpy
   1.24%  secbrot  [unknown]         [k] 0xffffffff81198b61
   1.23%  secbrot  [unknown]         [k] 0xffffffff81391ef5
   1.02%  secbrot  secbrot           [.] _intel_fast_memcmp
ende76 commented 8 years ago

Yes, performance has not been a focus point at all so far. I've started this project as a a way to get familiar with Rust, and also with the Brotli spec. For the implementation, I have made it a point to work only from the spec, while avoiding looking at the reference implementation, because another goal was helping to improve the Brotli spec itself. I do believe that there are a number of places where copies could be avoided, and other places where I used clone() simply because at the time I was unable otherwise to appease the borrow checker.

For a performance-oriented implementation, I would probably take the test cases and experiences so far, and start over with a new implementation.

danielrh commented 8 years ago

So I decided to work on a performance-focused version of the decompressor here: https://github.com/dropbox/rust-brotli

It passes the first few unit tests..but still has a few correctness issues I'm working through. I decided to do a very apples-to-apples port, avoiding even the rust stdlib, so that we can get a really solid performance comparison between C and Rust down to the same custom memory allocator.

I'd love comments if you have any--though of course it's still early since only the very basic tests pass

ende76 commented 8 years ago

Wow, that is a very impressive effort! Was rust-alloc-no-stdlib a direct consequence of the as-close-as-possible port? No helpful comments yet, other than that it looks like very organized code. It will certainly be a superior implementation once you've ironed out the last few correctness issues.