Cyan4973 / xxHash

Extremely fast non-cryptographic hash algorithm
http://www.xxhash.com/
Other
9.14k stars 777 forks source link

Clang 9.0.0 x86_64 performance regression in non-dispatched code #398

Closed easyaspi314 closed 4 years ago

easyaspi314 commented 4 years ago

It seems that there has been a large performance regression on Clang for x86_64 on the non-dispatched path. This does not affect GCC, only Clang apparently.

MacBookPro8,2 (15-inch, Early 2011) 2.0GHz Intel Core i7-2635QM (Sandy Bridge)

Clang 9.0.0 (macOS uses -msse4.1 by default)

dev (dispatch disabled)

 5#XXH3_64b                      :     102400 ->    87853 it/s ( 8579.4 MB/s)   
 6#XXH3_64b unaligned            :     102400 ->    82803 it/s ( 8086.2 MB/s)   
 7#XXH3_64b w/seed               :     102400 ->   134703 it/s (13154.6 MB/s)   
 8#XXH3_64b w/seed unaligned     :     102400 ->   126697 it/s (12372.8 MB/s)   
 9#XXH3_64b w/secret             :     102400 ->   123520 it/s (12062.5 MB/s)   
10#XXH3_64b w/secret unaligned   :     102400 ->   114818 it/s (11212.7 MB/s)   
11#XXH128                        :     102400 ->    77291 it/s ( 7548.0 MB/s)   
12#XXH128 unaligned              :     102400 ->    70927 it/s ( 6926.5 MB/s)   
13#XXH128 w/seed                 :     102400 ->   125048 it/s (12211.7 MB/s)   
14#XXH128 w/seed unaligned       :     102400 ->   118428 it/s (11565.2 MB/s)   
15#XXH128 w/secret               :     102400 ->    84696 it/s ( 8271.1 MB/s)   
16#XXH128 w/secret unaligned     :     102400 ->    74465 it/s ( 7272.0 MB/s) 

1b14f648d63ddbf66a99e043c472789575c3673e

XXH3_64b                      :     102400 ->   135562 it/s (13238.5 MB/s)      
XXH3_64b unaligned            :     102400 ->   125060 it/s (12212.9 MB/s)      
XXH3_64b w/seed               :     102400 ->   135817 it/s (13263.4 MB/s)      
XXH3_64b w/seed unaligned     :     102400 ->   128615 it/s (12560.1 MB/s)      
XXH3_64b w/secret             :     102400 ->   124632 it/s (12171.0 MB/s)      
XXH3_64b w/secret unaligned   :     102400 ->   115404 it/s (11269.9 MB/s)      
XXH128                        :     102400 ->   124097 it/s (12118.9 MB/s)      
XXH128 unaligned              :     102400 ->   117373 it/s (11462.3 MB/s)      
XXH128 w/seed                 :     102400 ->   122675 it/s (11980.0 MB/s)      
XXH128 w/seed unaligned       :     102400 ->   115714 it/s (11300.2 MB/s)      
XXH128 w/secret               :     102400 ->   104453 it/s (10200.5 MB/s)      
XXH128 w/secret unaligned     :     102400 ->   101271 it/s ( 9889.8 MB/s)    

Currently investigating.

easyaspi314 commented 4 years ago

Re-adding the specialization for defaultSecret brings back the performance, but I am a little confused about why it is not running at exactly the same speed as withSecret.

Additionally, just tail calling XXH3_64bits_withSecret has the correct performance.

Cyan4973 commented 4 years ago

Thanks for investigating @easyaspi314 . Indeed, this issue doesn't show up with gcc. Anyway, the proposed fix seems simple enough.

edit : strange, I don't see the impact, neither with my own version of clang on macosx, nor with clang v10.0 on ubuntu 20.04 ... edit 2: also tried clang v9.0.1 on ubuntu 20.04, no impact either ... edit 3: I can notice a ~10% impact with clang v8.0.1, which is small enough to be attributed to other causes, such as instruction alignment edit 4: switching to -O2 (instead of -O3) in the hope to reproduce the issue. Nope, not successfully. Performance issue still not observed.

Cyan4973 commented 4 years ago

I was unable to reproduce the issue on my platforms, but went ahead and produced a fix nonetheless (#398).

It's a logical fix, so I presume it should fix this performance issue for platforms suffering from it.

Cyan4973 commented 4 years ago

398 merged.

I'm still interested in knowing if it solves the reported issue on your system.