SciRuby / daru-io

daru-io is a plugin gem to the existing daru gem, which aims to add support to Importing DataFrames from / Exporting DataFrames to multiple formats.
http://www.rubydoc.info/github/athityakumar/daru-io/master/
MIT License
25 stars 9 forks source link

Modules from_avro and to_avro #11

Closed athityakumar closed 7 years ago

athityakumar commented 7 years ago

@zverok - I'm planning to work on the Avro Importer (and Exporter) next week and would like to have some clarity regarding both - Avro Importer & Avro Exporter. (I haven't used much of Avro, so please pardon the n00b questions 😉 )

zverok commented 7 years ago

No idea either :) Never used Avro myself. Let's do parallel investigation of the matter and write here what we'll found?

athityakumar commented 7 years ago

Sure. From the examples mentioned here and here, I understand that an avro contains Schema (ie, class of datatype like String / Integer / ...) of columns.

df = Daru::DataFrame.new(name: %w[Dany Jon Tyrion], age: %w[35 30 40])
df[:age].to_a
#=> ["35", "30", "40"]

df.use_avro('path/to/avro/file') #! Avro schema contains name: String, age: Integer
df[:age].to_a
#=>  [35, 30, 40]
df = Daru::DataFrame.new(name: %w[Dany Jon Tyrion], age: [35, nil, 40]) #! nil, because data isn't available (say)
df.to_avro('path/to/avro/file')
#=> TypeError: Column 'age' contains values of different classes - FixNum & NilClass.
zverok commented 7 years ago

From the examples mentioned here and here, I understand that an avro contains Schema (ie, class of datatype like String / Integer / ...) of columns.

I am not sure this is right. Just definition and examples in Wikipedia I believe .avro files contain schema AND data.

And this page contains some multi-megabyte example datasets, I doubt it is just a schema ;)

athityakumar commented 7 years ago

My bad, really sorry. I went through the above links and YES - avro does indeed contain both Schema & Data. I was unable to find any examples that contain data (previously). But I now recently had a look at this gem of a link, and you're quite right. Thanks a lot! I'll soon start working on this. 😄

P.S - It wasn't about not finding fixture files that contain data. Infact, all avro files do contain data. It was just the methods that would reveal the data, that I wasn't able to find from the avro gem until just recently.

athityakumar commented 7 years ago

Avro Importer is quite sorted out now. 😄

Regarding Avro Exporter, I think that the schema details should be provided from the user. But can we attempt (or maybe for later?) in 'guessing' the schema details (like, :type, :name and :fields) from the Daru::DataFrame? Or would this be too unreliable / unnecessarily hacky?