RevolutionAnalytics / ravro

9 stars 11 forks source link

write.avro fails on data frame #3

Open piccolbo opened 9 years ago

piccolbo commented 9 years ago

Error is ravro:::write.avro(df, tf1) Exception in thread "main" org.apache.avro.SchemaParseException: Enum has no symbols: {"name":"col_2","type":"enum","symbols":"d"} at org.apache.avro.Schema.parse(Schema.java:1121) at org.apache.avro.Schema.parse(Schema.java:1094) at org.apache.avro.Schema$Parser.parse(Schema.java:927) at org.apache.avro.Schema$Parser.parse(Schema.java:917) at org.apache.avro.Schema.parse(Schema.java:966) at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:91) at org.apache.avro.tool.Main.run(Main.java:80) at org.apache.avro.tool.Main.main(Main.java:69)

dump of df

df <- structure(list(col_1 = 139.084976531123, col_2 = structure(1L, .Label = "d", class = "factor"), col_3 = TRUE, col_4 = FALSE, col_5 = -11.3948273417181, col_6 = 90.2836501356233, col_7 = structure(1L, .Label = "", class = "factor"), col_8 = structure(1L, .Label = "57be", class = "factor")), .Names = c("col_1", "col_2", "col_3", "col_4", "col_5", "col_6", "col_7", "col_8" ), row.names = c(NA, -1L), class = "data.frame")

Another instance

Exception in thread "main" org.apache.avro.SchemaParseException: Enum has no symbols: {"name":"col_1","type":"enum","symbols":"_6f7a4bc347_ravro"} at org.apache.avro.Schema.parse(Schema.java:1121) at org.apache.avro.Schema.parse(Schema.java:1094) at org.apache.avro.Schema$Parser.parse(Schema.java:927) at org.apache.avro.Schema$Parser.parse(Schema.java:917) at org.apache.avro.Schema.parse(Schema.java:966) at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:91) at org.apache.avro.tool.Main.run(Main.java:80) at org.apache.avro.tool.Main.main(Main.java:69)

Dump

df <- structure(list(col_1 = structure(1L, .Label = "6f7a4bc347", class = "factor"), col_2 = structure(1L, .Label = "46f315f9", class = "factor"), col_3 = -158.916518470489, col_4 = -72.4716823839384, col_5 = 34L, col_6 = structure(1L, .Label = "6f7a", class = "factor"), col_7 = -10L, col_8 = 10L), .Names = c("col_1", "col_2", "col_3", "col_4", "col_5", "col_6", "col_7", "col_8"), row.names = c(NA, -1L), class = "data.frame")

My theory from several example is failure occurs iff input is a data frame with a single row and at least one factor column

jamiefolson commented 9 years ago

My two thoughts:

1) Does Avro allow an enum with only one level? 2) If an enum is allowed to have a single level, we might need to change the enum levels from a character vector to a list, so that toJSON will produce ["d"] instead of "d".

Jamie Olson

On Tue, Mar 3, 2015 at 3:59 PM, Antonio Piccolboni <notifications@github.com

wrote:

Error is ravro:::write.avro(df, tf1) Exception in thread "main" org.apache.avro.SchemaParseException: Enum has no symbols: {"name":"col_2","type":"enum","symbols":"d"} at org.apache.avro.Schema.parse(Schema.java:1121) at org.apache.avro.Schema.parse(Schema.java:1094) at org.apache.avro.Schema$Parser.parse(Schema.java:927) at org.apache.avro.Schema$Parser.parse(Schema.java:917) at org.apache.avro.Schema.parse(Schema.java:966) at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:91) at org.apache.avro.tool.Main.run(Main.java:80) at org.apache.avro.tool.Main.main(Main.java:69)

dump of df

df <- structure(list(col_1 = 139.084976531123, col_2 = structure(1L, .Label = "d", class = "factor"), col_3 = TRUE, col_4 = FALSE, col_5 = -11.3948273417181, col_6 = 90.2836501356233, col_7 = structure(1L, .Label = "", class = "factor"), col_8 = structure(1L, .Label = "57be", class = "factor")), .Names = c("col_1", "col_2", "col_3", "col_4", "col_5", "col_6", "col_7", "col_8" ), row.names = c(NA, -1L), class = "data.frame")

Another instance

Exception in thread "main" org.apache.avro.SchemaParseException: Enum has no symbols: {"name":"col_1","type":"enum","symbols":"_6f7a4bc347_ravro"} at org.apache.avro.Schema.parse(Schema.java:1121) at org.apache.avro.Schema.parse(Schema.java:1094) at org.apache.avro.Schema$Parser.parse(Schema.java:927) at org.apache.avro.Schema$Parser.parse(Schema.java:917) at org.apache.avro.Schema.parse(Schema.java:966) at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:91) at org.apache.avro.tool.Main.run(Main.java:80) at org.apache.avro.tool.Main.main(Main.java:69)

Dump

df <- structure(list(col_1 = structure(1L, .Label = "6f7a4bc347", class = "factor"), col_2 = structure(1L, .Label = "46f315f9", class = "factor"), col_3 = -158.916518470489, col_4 = -72.4716823839384, col_5 = 34L, col_6 = structure(1L, .Label = "6f7a", class = "factor"), col_7 = -10L, col_8 = 10L), .Names = c("col_1", "col_2", "col_3", "col_4", "col_5", "col_6", "col_7", "col_8"), row.names = c(NA, -1L), class = "data.frame")

My theory from several example is failure occurs iff input is a data frame with a single row and at least one factor column

— Reply to this email directly or view it on GitHub https://github.com/RevolutionAnalytics/ravro/issues/3.

piccolbo commented 9 years ago

I think it's admissible from reading the specs, but I am not sure it should be very high on our priority list. How useful are single level enums in real life? I modified my tests to generate at least two levels. I think we can reasonably delay this until there is a second request.

piccolbo commented 9 years ago

I mean you can close with won't fix AFAIK