SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 139 forks source link

Define JSON importer/exporter #327

Closed ananyo2012 closed 7 years ago

ananyo2012 commented 7 years ago

I think this is a good time to work on this. There is already a to_h method for dataframe and vectors but no such File.write method for json. API calls mostly follow json formats so the importer should be able to read json data from api calls. We can start off by a simple write_to_json method and from_json method in Daru::IO module. Since Ruby comes with json we don't need to add any extra dependency. We just need to require json .

zverok commented 7 years ago

As always, I propose to think on some real data examples. Just one, yet pretty interesting point is OpenWeatherMap. Here is API sample: http://samples.openweathermap.org/data/2.5/forecast?lat=35&lon=139&appid=b1b15e88fa797225412429c1c50c122a1 (here are docs: https://openweathermap.org/forecast5). Ideally, we need to be able doing something like...

URL = '...open weather map sample url...'
response = open(URL).read
df = Daru::DataFrame.from_json(response, some: inventive, parameters: of_conversion)
# => DataFrame with timestamp, temperature, wind speed and so on

Ideas?

zverok commented 7 years ago

As discussed with @athityakumar in mail, there is a time to think more clearly about what is necessary here. Some statements:

  1. Default representation: We can't decide whether "array of hashes" (each hash is a row) or "hash of arrays" (each column is key: array of values) is more natural, it depends on situation. So, we should probably always support both options for import and export. For export -- probably as simple as to_json_hash and to_json_array, without introduction of tons of options (though, we can discuss it, it is just a suggestion).
  2. After that, there could be a lot of complicated cases, e.g. deeply nested structures for import and export. For import, it is output of some APIs, JSON documents from Mongo or ElasticSearch... For export, it could be need to create, again Mongo/ElasticSearch documents for insertion back. Or JSON-based formats like GeoJSON. I am not sure what set of options/settings could be useful here, but have a feeling that JSONPath could be utilized for the greater good, like this:
Daru::IO.from_json("{some: json}",
  index: '$.some.json.path',
  col1: '$.some.other.path',
  col2: '$.even.more.path'
)

It is, though, just a rough idea for consideration, not an instruction for implementing :)

athityakumar commented 7 years ago

Agreed - the example especially looks good. :+1:

Giving xpath-like option with jsonpath gem to users will definitely make the from_json module user-friendly for nested JSONs (which most social-media graph APIs provide).

athityakumar commented 7 years ago

@zverok - Regarding the exporter, should using from_json and to_json recreate the json source? That is, should

df = Daru::DataFrame.from_json(source, xpath-opts)
df.to_json(inverse-xpath-opts or block)

recreate the json source x?

Or maybe if a user wishes to recreate the json source,

df = Daru::DataFrame.from_json(source, xpath-opts)
df.to_json.map { |ele| restructure(ele) }

would be an easier way out?

zverok commented 7 years ago

I am not sure what's the scope of your question. You mean, char-by-char correspondence? I don't think it should be the first goal, though, it is an interesting side-task (which by the way validates equivalent power of importer and exporter).

athityakumar commented 7 years ago

Yes, I was asking whether the equivalent power of importer and exporter should (or can) be provided to JSON. Because, for creating a complex nested JSON from a DataFrame (to_json) - the missing data have to be provided manually (unless we store them as a class variable, which wouldn't be good either), and there's no other way apart from the user manually mapping and manipulating the hash given by to_json right?

zverok commented 7 years ago

Yes, seems so. If the user needs structure like

{
  metadata: {something}
  data: [
    //the real DF output, like: {col: value, col2: value2}
  ]
}

...then the most simple option would be, probably, just construct something like

df.to_json(col1: '$.data.col', col2: '$.data.col2')

...and then merge it with some metadata. But probably in this case also useful thing would be method like as_json, which will return not string, but plain Ruby structures (hashes and arrays) which will be easier to merge with other data.

athityakumar commented 7 years ago

@zverok - I've submitted a Pull Request for JSON Importer, with support for parsing from specific x-paths as per this issue. Please review https://github.com/athityakumar/daru-io/pull/21 whenever you're free 😄

zverok commented 7 years ago

OK, let's close this ticket, than, and for any further considerations continue in daru-io project.

For the record, I am not very happy about the fact that ton of PRs were merged without me :( Of course, I am guilty myself, because I've been absent almost for 10 days at important period, but in future feel free to at least ping me and ask about it -- I was "kinda" online, so at least I could say something like "OK, merge it" or "Sorry, wait for me, work on next task in the meantime", OK?

zverok commented 7 years ago

...and in fact you merged it AFTER that I've wrote I am back and reviewing everything. It is pretty weird. I have a lot of consideration about that one, and will review it now, and you'll need to plan fix my notes later, when you'll have time.

athityakumar commented 7 years ago

Sorry @zverok, I merged them only after it was approved by at least one of the mentors. I'll definitely take this in consideration in the subsequent PRs. I'm genuinely sorry. I'll definitely take your reviews for all these merged PRs. Please feel free to review. 😄