Open xiaodaigh opened 6 years ago
Hi @xiaodaigh, thanks for your question. In an R
factor, the value NA
is encoded as a NA
value in the value vector:
# some factor
x <- factor(sample(LETTERS, 10), levels = LETTERS)
# set factor value to NA
x[5] <- NA
# underlying value is set to NA
as.integer(x)
#> [1] 23 22 16 18 NA 5 8 4 2 26
This could have been done better in R
I guess, for example by coding the value 0 as the NA
, but with the current implementation that leads to an error:
# create factor manually
y <- c(1L, 2L, 3L, 0L, 4L, 5L)
attr(y, "levels") <- LETTERS
attr(y, "class") <- "factor"
print(y)
#> Error in as.character.factor(x): malformed factor
For performance reasons, fst
takes bit 32 from the factor values and adds bit 0-7 to that to get a single byte. So these 7 bits can only be used for < 128 levels. I could also re-code value 0 as an NA
, but that would require more processing and would reduce the speed of the filter...
Hope that answers your question!
I was looking into lib/factor/factor_v7.cpp and see code like
if (*nrOfLevels < 128)
. In the comment it sayswhich seems to be "wasting" the other 7 bits once that one bit is used, technically can support up to 256 distinct values (including NA and NaN).
Without much background, I assume it's to do with how R encodes the values, so it's always stored as
int
instead ofunsigned int
. I know if it's too expensive in terms of performance to relax this to 256 by converting tounsigned int
. I know Julia supports unsigned with its UInt8 type.