RevolutionAnalytics / ravro

9 stars 11 forks source link

Handling of NAs #1

Open piccolbo opened 10 years ago

piccolbo commented 10 years ago

Hi, I have a data frame containing some NAs in one colum, When I write it out and read it back in, that column contains only NAs. What gives?

test case

df = 
structure(list(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), y = structure(1:10, .Label = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "factor"), 
    fac = structure(c(1L, NA, NA, 2L, 2L, 1L, NA, 2L, 1L, 1L), .Label = c("b", 
    "c"), class = "factor")), .Names = c("x", "y", "fac"), row.names = c(NA, 
-10L), class = "data.frame")

write.avro(df, "/tmp/testxx.avro")
read.avro("/tmp/testxx.avro")
   x  y  fac
1  1  1 <NA>
2  1  2 <NA>
3  1  3 <NA>
4  1  4 <NA>
5  1  5 <NA>
6  1  6 <NA>
7  1  7 <NA>
8  1  8 <NA>
9  1  9 <NA>
10 1 10 <NA>

Thanks

jamiefolson commented 10 years ago

That is a very good question. An NA factor likely require serialization as a union of an enum with null, which may be either serialized or serialized incorrectly. I made some simplifying assumptions that were sufficient at the time, but we may need to extend the logic around unions. On Oct 20, 2014 7:24 PM, "Antonio Piccolboni" notifications@github.com wrote:

Hi, I have a data frame containing some NAs in one colum, When I write it out and read it back in, that column contains only NAs. What gives?

test case

df = structure(list(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), y = structure(1:10, .Label = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), class = "factor"), fac = structure(c(1L, NA, NA, 2L, 2L, 1L, NA, 2L, 1L, 1L), .Label = c("b", "c"), class = "factor")), .Names = c("x", "y", "fac"), row.names = c(NA, -10L), class = "data.frame")

write.avro(df, "/tmp/testxx.avro") read.avro("/tmp/testxx.avro")

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/ravro/issues/1.