fstpackage / fstlib

A C++ library for lightning fast multi-threaded serialization of tabular data. Home to the `fst` file format.
Mozilla Public License 2.0
37 stars 9 forks source link

Can the small factors levels limits be increased from <128? #4

Open xiaodaigh opened 6 years ago

xiaodaigh commented 6 years ago

I was looking into lib/factor/factor_v7.cpp and see code like if (*nrOfLevels < 128). In the comment it says

// use 1 byte per int (Na encoding takes 1 bit)

which seems to be "wasting" the other 7 bits once that one bit is used, technically can support up to 256 distinct values (including NA and NaN).

Without much background, I assume it's to do with how R encodes the values, so it's always stored as int instead of unsigned int. I know if it's too expensive in terms of performance to relax this to 256 by converting to unsigned int. I know Julia supports unsigned with its UInt8 type.

MarcusKlik commented 6 years ago

Hi @xiaodaigh, thanks for your question. In an R factor, the value NA is encoded as a NA value in the value vector:

# some factor
x <- factor(sample(LETTERS, 10), levels = LETTERS)

# set factor value to NA
x[5] <- NA

# underlying value is set to NA
as.integer(x)
#>  [1] 23 22 16 18 NA  5  8  4  2 26

This could have been done better in R I guess, for example by coding the value 0 as the NA, but with the current implementation that leads to an error:

# create factor manually
y <- c(1L, 2L, 3L, 0L, 4L, 5L)
attr(y, "levels") <- LETTERS
attr(y, "class") <- "factor"

print(y)
#> Error in as.character.factor(x): malformed factor

For performance reasons, fst takes bit 32 from the factor values and adds bit 0-7 to that to get a single byte. So these 7 bits can only be used for < 128 levels. I could also re-code value 0 as an NA, but that would require more processing and would reduce the speed of the filter...

Hope that answers your question!