cockroachdb / swiss

Go port of Google's Swiss Table hash table
Apache License 2.0
317 stars 12 forks source link

add ctrlGroup and ctrlBytes types #12

Closed RaduBerinde closed 9 months ago

RaduBerinde commented 9 months ago

This is what I had in mind.

add ctrlGroup and ctrlBytes types

This commit adds ctrlGroup which represents an 8-byte group and ctrlBytes which adds more functionality on top of the unsafeSlice.

We now retrieve a ctrlGroup from ctrlBytes and the matching functions are methods on ctrlGroup. Previously, these methods were casting from the *ctrl receiver.

cockroach-teamcity commented 9 months ago

This change is Reviewable

RaduBerinde commented 9 months ago

Added two more runs to the data:

StringMap/avgLoad,n=10/swissMap/Get-10         8.35ns ± 1%      8.46ns ± 1%  +1.34%  (p=0.000 n=15+15)
StringMap/avgLoad,n=83/swissMap/Get-10         8.77ns ± 9%      8.99ns ±11%    ~     (p=0.140 n=15+16)
StringMap/avgLoad,n=671/swissMap/Get-10        8.91ns ± 5%      9.06ns ± 3%  +1.66%  (p=0.024 n=16+15)
StringMap/avgLoad,n=5375/swissMap/Get-10       9.25ns ± 2%      9.41ns ± 1%  +1.70%  (p=0.000 n=16+16)
StringMap/avgLoad,n=86015/swissMap/Get-10      11.1ns ± 3%      11.1ns ± 3%    ~     (p=0.323 n=15+16)
RaduBerinde commented 9 months ago

I think making the ctrlGroup methods have pointer receivers (and using GroupAt) fixed the difference. I assume we were doing an extra unnecessary copy to a register first.

StringMap/avgLoad,n=10/swissMap/Get-4         8.38ns ± 1%      8.37ns ± 1%   ~     (p=0.723 n=18+20)
StringMap/avgLoad,n=83/swissMap/Get-4         8.87ns ±11%      8.81ns ± 9%   ~     (p=0.829 n=20+24)
StringMap/avgLoad,n=671/swissMap/Get-4        8.96ns ± 3%      8.93ns ± 3%   ~     (p=0.368 n=20+24)
StringMap/avgLoad,n=5375/swissMap/Get-4       9.33ns ± 1%      9.31ns ± 2%   ~     (p=0.465 n=20+24)
StringMap/avgLoad,n=86015/swissMap/Get-4      11.2ns ± 3%      11.1ns ± 3%   ~     (p=0.823 n=18+20)
Int64Map/avgLoad,n=10/swissMap/Get-4          4.86ns ± 1%      4.86ns ± 1%   ~     (p=0.872 n=17+17)
Int64Map/avgLoad,n=83/swissMap/Get-4          5.32ns ±12%      5.36ns ±13%   ~     (p=0.542 n=20+20)
Int64Map/avgLoad,n=671/swissMap/Get-4         5.42ns ± 7%      5.41ns ± 7%   ~     (p=0.815 n=20+20)
Int64Map/avgLoad,n=5375/swissMap/Get-4        5.85ns ± 1%      5.82ns ± 3%   ~     (p=0.189 n=20+20)
Int64Map/avgLoad,n=86015/swissMap/Get-4       7.02ns ± 1%      7.01ns ± 1%   ~     (p=0.230 n=19+18)
petermattis commented 9 months ago

Hmm, I would not have guessed that a pointer vs value receiver would have made a difference given both are 64-bit. Feels like a compiler deficiency if it is reproducible. The change LGTM.

RaduBerinde commented 9 months ago

Yeah no longer sure that was it. The assembly looks almost the same between the two. It might have been measurement error (the last numbers were obtained by switching back and forth 10 times (with -count 5 -benchtime 100000000x each time).

Assembly for match := g.matchH2(h2(h)) in Get. Pointer receiver:

  0x10016bb70       f9400045        MOVD (R2), R5       
    return h & 0x7f
  0x10016bb74       92401806        AND $127, R0, R6    
    v := uint64(*g) ^ (bitsetLSB * uint64(h))
  0x10016bb78       f86468a5        MOVD (R5)(R4), R5       
  0x10016bb7c       b200c3e7        MOVD $72340172838076673, R7 
  0x10016bb80       9b067ce6        MUL R6, R7, R6          
  0x10016bb84       ca0600a8        EOR R6, R5, R8          
    return bitset(((v - bitsetLSB) &^ v) & bitsetMSB)
  0x10016bb88       cb070108        SUB R7, R8, R8      
  0x10016bb8c       ca2600a6        EON R6, R5, R6      
  0x10016bb90       8a060106        AND R6, R8, R6      
        match := g.matchH2(h2(h))
  0x10016bb94       d503201f        NOOP            
  0x10016bb98       d503201f        NOOP            
    return bitset(((v - bitsetLSB) &^ v) & bitsetMSB)
  0x10016bb9c       9201c0c6        AND $-9187201950435737472, R6, R6   

Value receiver:

  0x10016bb50       f9400045        MOVD (R2), R5       
        match := g.matchH2(h2(h))
  0x10016bb54       f86468a5        MOVD (R5)(R4), R5   
    return h & 0x7f
  0x10016bb58       92401806        AND $127, R0, R6    
    v := uint64(g) ^ (bitsetLSB * uint64(h))
  0x10016bb5c       b200c3e7        MOVD $72340172838076673, R7 
  0x10016bb60       9b067ce6        MUL R6, R7, R6          
  0x10016bb64       ca0600a8        EOR R6, R5, R8          
    return bitset(((v - bitsetLSB) &^ v) & bitsetMSB)
  0x10016bb68       cb070108        SUB R7, R8, R8              
  0x10016bb6c       ca2600a6        EON R6, R5, R6              
  0x10016bb70       8a060106        AND R6, R8, R6              
  0x10016bb74       9201c0c6        AND $-9187201950435737472, R6, R6   
petermattis commented 9 months ago

The ordering of the first 3 instructions differ. With the value receiver we're doing 2 adjacent loads where the second depends on the first. Mildly hard to believe that is the difference. 🤷🏼 The pointer receiver seems just as understandable. I'd just go with that unless you have a desire to understand this further.

RaduBerinde commented 9 months ago

I'd just go with that unless you have a desire to understand this further.

Agreed, and thanks for stopping me from nerd-sniping myself :)