SciRuby / daru-io

daru-io is a plugin gem to the existing daru gem, which aims to add support to Importing DataFrames from / Exporting DataFrames to multiple formats.
http://www.rubydoc.info/github/athityakumar/daru-io/master/
MIT License
25 stars 9 forks source link

Module to_json #10

Closed athityakumar closed 7 years ago

athityakumar commented 7 years ago

Followed from this issue tracker.

athityakumar commented 7 years ago

@zverok - This is planned to be done next week, and could be really helpful to discuss the various to_json output options. These are some options I've in mind, but please feel free to share any use-case(s) that you might prefer (anything with blocks / json-paths?). 😄

df = Daru::DataFrame.new [[1,2],[3,4]], order: [:x, :y], index: [:a, :b]

df.to_json
#=> {:data=>[[1,2],[3,4]], :order=>[:x, :y], :index=>[:a, ;b]}

df.to_json :order_first
#=> {:x=>{:a=>1, :b=>2}, :y=>{:a=>3, :b=>4}}

df.to_json :index_first
#=> {:a=>{:x=>1, :y=>3}, :b=>{:x=>2, :y=>4}}
abinashmeher999 commented 7 years ago

Hello @athityakumar and @zverok, If you don't mind me chiming in I would like to give my 2 cents on this issue. This is just a suggestion from someone who has come across exporting to json before but isn't very good with daru.

Usually in a dataframe(or spreadsheet for that matter), every row is an entry and the columns are attributes. So semantically every row/entry from left most column to right most column is an 'object'. Each cell then is a 'property' for that object, with column name as the 'name'/key and it's corresponding cell value as the 'value'. For default behavior of df.to_json one would expect it to translate in that way.

For example here, in sales-funnel DataFrame, would translate to

[
 {
   "Account": 714466,
   "Name": "Trantow-Barrows",
   "Rep": "Craig Booker",
   "Manager": "Debra Henley",
   "Product": "CPU",
   "Quantity": 1,
   "Price": 30000,
   "Status": "presented"
 },
 {
   "Account": 714466,
   "Name": "Trantow-Barrows",
   "Rep": "Craig Booker",
   "Manager": "Debra Henley",
   "Product": "Software",
   "Quantity": 1,
   "Price": 10000,
   "Status": "presented"
 },
...
]

P.S. I used the tool http://www.convertcsv.com/csv-to-json.htm for the example output. But yeah csv and dataframes are totally different and the above logic might not suffice for all cases like there's index information in dataframes too.

EDIT: I hadn't seen the other discussion 😅 apologies. As @athityakumar has suggested for default of df.to_json might be able to handle all cases but again the output of df.to_json :index_first seems more intuitive. IMHO It would be better if the output of current default can be obtained through something like df.to_json :data_first(bad name I know 😛) and df.to_json and df.to_json :index_first give the same output.

zverok commented 7 years ago

My opinion(s):

  1. What @abinashmeher999 says seems absolutely reasonable for me, so I believe default output should follow his suggestion.
  2. For other forms of output, I'd say that flexibility should be considered first, and convenience last (because above flexible solution it is pretty easy to do several convenience shortcuts). So, let's try to design some generic approach (JsonPath should be OK, probably) that will allow at least all three forms that mentioned in initial message of this thread -- as soon as generic approach would be found, serveral shortcuts would be really easy to define.

OK?..

athityakumar commented 7 years ago

@zverok - Sadly, JsonPath gem doesn't support creating nested hashes of corresponding jsonpath like $..person..name => hash[:person][:name]. But, a workaround including deep_merge of Hashes to recursively create a complexly nested json response seems to be working (at least, with index-first feature). Yet to be battle-tested though. Have a look at this temporary Ruby Script and output JSON file.

Meanwhile, how exactly should index values be accommodated within the json response?

zverok commented 7 years ago

OK, I've tried to make my head around it, and that's what I've got.

  1. As discussed above, "array of row-based objects" is the most natural representation of scientific data, typically;
  2. If we want to stick to this structure but rename/nest some fields (like in your demo script), something like JsonPath will help.
  3. But then, what if we want the structures of your initial post? Like [column name: column object] or [row name: row object] (instead of just array)? Or even more complicated, like [third row value: row object]?.. In my head, it goes something like this:
df.to_json(
  name: '$.{index}.name',
  age: '$.{index}.demography.age'
) # => {child: {name: 'Jon Snow', demography: {age: 18}, ...

df.transpose.to_json(
  child: '$.hero.{index}',
  mom: '$.dead.{index}',
) # => {hero: {name: 'Jon Snow', age: 18, gender: 'Male}, ...

df.transpose.to_json(
  '*' => '$.{name}.{index}
) # => {'Jon Snow': {name: ...}, 'Lyanna Stark': {name: ...}

WDYT?..

athityakumar commented 7 years ago

@zverok - Agreed, dynamic JsonPaths like $..person..{index} or $..person..{name} have been implemented in PR #40, please review. 😄

However, I'm not sure if something like '*' => '$.{name}.{index} can be supported automatically, rather than explicitly checking for a * argument.

athityakumar commented 7 years ago

@zverok - Dynamic JsonPaths have been supported in PR #40 as of now, but the output is always given as Array of Nested Hashes. Here are some use-cases I've in mind with JsonPaths and block to have more flexible output like Array of Arrays, or Hash of Hashes.

IMHO, using a block to manipulate the Array of Hashes obtained from Jsonpaths will be the most flexible way to use from a user's POV.

(0) Array of Nested Hashes (currently supported) -

df.to_json(name: '$.person.name', age: '$.person.age', sex: '$.person.gender', index: '$.relation')
#=> [ { relation: :child, person: { name: 'Jon Snow', age: 18, gender: 'Male' }} , ... ]

(1) Array of Arrays -

df.to_json(name: '$.person.name', age: '$.person.age', sex: '$.person.gender', index: '$.relation') do |json|
  json.map(&:values).map { |a,b| [a, b.values].flatten }
end
#=> [ [ :child, 'Jon Snow', 18, 'Male' ] , ... ]

(2) Hash of Arrays -

df.to_json(name: '$.person.name', age: '$.person.age', sex: '$.person.gender', index: '$.relation') do |json|
  json.map(&:values).map { |a,b| [a, b.values] }.to_h
end
#=> { child: [  'Jon Snow', 18, 'Male' ] , ... }

(3) Hash of Hashes -

df.to_json(sex: '$.{index}.gender', name: '$.{index}.name', age: '$.{index}.age') do |json|
  json.map { |j| [j.keys.first, j.values.first] }.to_h
end
#=> { child: { name: 'Jon Snow', age: 18, gender: 'Male' } , ... }

IMHO, I think block manipulation would work very well with :orient option too from a user's POV. Though it might seem a bit complex from a developer's POV. Please express your opinions on what we can continue forward with. 😄

zverok commented 7 years ago

OK, let's proceed with your idea. I still feel like we can do it more generic... But can't wrap my head around better solution. Sorry for taking you back from it.

athityakumar commented 7 years ago

@zverok - No issues. So, should the JSON Exporter have :orient option, or block, or both? Just wanted to clarify this. 😄

zverok commented 7 years ago

Just do it the way you feel most clear and reasonable.

athityakumar commented 7 years ago

Added with PR #40. 🎉