Can the small factors levels limits be increased from <128?

fstpackage / fstlib

A C++ library for lightning fast multi-threaded serialization of tabular data. Home to the `fst` file format.

Mozilla Public License 2.0

37 stars 9 forks source link

Hi @xiaodaigh, thanks for your question. In an R factor, the value NA is encoded as a NA value in the value vector:

# some factor
x <- factor(sample(LETTERS, 10), levels = LETTERS)

# set factor value to NA
x[5] <- NA

# underlying value is set to NA
as.integer(x)
#>  [1] 23 22 16 18 NA  5  8  4  2 26

This could have been done better in R I guess, for example by coding the value 0 as the NA, but with the current implementation that leads to an error:

# create factor manually
y <- c(1L, 2L, 3L, 0L, 4L, 5L)
attr(y, "levels") <- LETTERS
attr(y, "class") <- "factor"

print(y)
#> Error in as.character.factor(x): malformed factor

For performance reasons, fst takes bit 32 from the factor values and adds bit 0-7 to that to get a single byte. So these 7 bits can only be used for < 128 levels. I could also re-code value 0 as an NA, but that would require more processing and would reduce the speed of the filter...

Hope that answers your question!

fstpackage / fstlib

Can the small factors levels limits be increased from <128? #4