ContinuumIO / cyberpandas

IP Address dtype and block for pandas
BSD 3-Clause "New" or "Revised" License
104 stars 23 forks source link

Factorize fix #11

Closed TomAugspurger closed 6 years ago

TomAugspurger commented 6 years ago

This fixes an issue in the old factorization method, which didn't properly account for missing values. Basically

[B, B, NA, NA, A, B]

Should factorize as [0, 0, -1, -1, 1, 0]. Previously, we didn't handle NA so it was [0, 0, 1, 1, 2, 0].

Numba gave a 285x speedup (after JIT warmup) on a benchmark with 10,000 values.