Closed athityakumar closed 7 years ago
@zverok - This is planned to be done next week, and could be really helpful to discuss the various to_json
output options. These are some options I've in mind, but please feel free to share any use-case(s) that you might prefer (anything with blocks / json-paths?). 😄
df = Daru::DataFrame.new [[1,2],[3,4]], order: [:x, :y], index: [:a, :b]
df.to_json
#=> {:data=>[[1,2],[3,4]], :order=>[:x, :y], :index=>[:a, ;b]}
df.to_json :order_first
#=> {:x=>{:a=>1, :b=>2}, :y=>{:a=>3, :b=>4}}
df.to_json :index_first
#=> {:a=>{:x=>1, :y=>3}, :b=>{:x=>2, :y=>4}}
Hello @athityakumar and @zverok, If you don't mind me chiming in I would like to give my 2 cents on this issue. This is just a suggestion from someone who has come across exporting to json before but isn't very good with daru.
Usually in a dataframe(or spreadsheet for that matter), every row is an entry and the columns are attributes. So semantically every row/entry from left most column to right most column is an 'object'. Each cell then is a 'property' for that object, with column name as the 'name'/key and it's corresponding cell value as the 'value'. For default behavior of df.to_json
one would expect it to translate in that way.
For example here, in sales-funnel DataFrame, would translate to
[
{
"Account": 714466,
"Name": "Trantow-Barrows",
"Rep": "Craig Booker",
"Manager": "Debra Henley",
"Product": "CPU",
"Quantity": 1,
"Price": 30000,
"Status": "presented"
},
{
"Account": 714466,
"Name": "Trantow-Barrows",
"Rep": "Craig Booker",
"Manager": "Debra Henley",
"Product": "Software",
"Quantity": 1,
"Price": 10000,
"Status": "presented"
},
...
]
P.S. I used the tool http://www.convertcsv.com/csv-to-json.htm for the example output. But yeah csv and dataframes are totally different and the above logic might not suffice for all cases like there's index information in dataframes too.
EDIT:
I hadn't seen the other discussion 😅 apologies.
As @athityakumar has suggested for default of df.to_json
might be able to handle all cases but again the output of df.to_json :index_first
seems more intuitive. IMHO It would be better if the output of current default can be obtained through something like df.to_json :data_first
(bad name I know 😛) and df.to_json
and df.to_json :index_first
give the same output.
My opinion(s):
OK?..
@zverok - Sadly, JsonPath
gem doesn't support creating nested hashes of corresponding jsonpath like $..person..name
=> hash[:person][:name]
. But, a workaround including deep_merge
of Hashes to recursively create a complexly nested json response seems to be working (at least, with index-first
feature). Yet to be battle-tested though. Have a look at this temporary Ruby Script and output JSON file.
Meanwhile, how exactly should index values be accommodated within the json response?
OK, I've tried to make my head around it, and that's what I've got.
[column name: column object]
or [row name: row object]
(instead of just array)? Or even more complicated, like [third row value: row object]
?.. In my head, it goes something like this:df.to_json(
name: '$.{index}.name',
age: '$.{index}.demography.age'
) # => {child: {name: 'Jon Snow', demography: {age: 18}, ...
df.transpose.to_json(
child: '$.hero.{index}',
mom: '$.dead.{index}',
) # => {hero: {name: 'Jon Snow', age: 18, gender: 'Male}, ...
df.transpose.to_json(
'*' => '$.{name}.{index}
) # => {'Jon Snow': {name: ...}, 'Lyanna Stark': {name: ...}
WDYT?..
@zverok - Agreed, dynamic JsonPaths like $..person..{index}
or $..person..{name}
have been implemented in PR #40, please review. 😄
However, I'm not sure if something like '*' => '$.{name}.{index}
can be supported automatically, rather than explicitly checking for a *
argument.
@zverok - Dynamic JsonPaths have been supported in PR #40 as of now, but the output is always given as Array of Nested Hashes. Here are some use-cases I've in mind with JsonPaths and block to have more flexible output like Array of Arrays, or Hash of Hashes.
IMHO, using a block to manipulate the Array of Hashes obtained from Jsonpaths will be the most flexible way to use from a user's POV.
(0) Array of Nested Hashes (currently supported) -
df.to_json(name: '$.person.name', age: '$.person.age', sex: '$.person.gender', index: '$.relation')
#=> [ { relation: :child, person: { name: 'Jon Snow', age: 18, gender: 'Male' }} , ... ]
(1) Array of Arrays -
df.to_json(name: '$.person.name', age: '$.person.age', sex: '$.person.gender', index: '$.relation') do |json|
json.map(&:values).map { |a,b| [a, b.values].flatten }
end
#=> [ [ :child, 'Jon Snow', 18, 'Male' ] , ... ]
(2) Hash of Arrays -
df.to_json(name: '$.person.name', age: '$.person.age', sex: '$.person.gender', index: '$.relation') do |json|
json.map(&:values).map { |a,b| [a, b.values] }.to_h
end
#=> { child: [ 'Jon Snow', 18, 'Male' ] , ... }
(3) Hash of Hashes -
df.to_json(sex: '$.{index}.gender', name: '$.{index}.name', age: '$.{index}.age') do |json|
json.map { |j| [j.keys.first, j.values.first] }.to_h
end
#=> { child: { name: 'Jon Snow', age: 18, gender: 'Male' } , ... }
IMHO, I think block manipulation would work very well with :orient
option too from a user's POV. Though it might seem a bit complex from a developer's POV. Please express your opinions on what we can continue forward with. 😄
OK, let's proceed with your idea. I still feel like we can do it more generic... But can't wrap my head around better solution. Sorry for taking you back from it.
@zverok - No issues. So, should the JSON Exporter have :orient
option, or block, or both? Just wanted to clarify this. 😄
Just do it the way you feel most clear and reasonable.
Added with PR #40. 🎉
Followed from this issue tracker.