Closed alamb closed 3 years ago
Comment from Neville Dipale(nevi_me) @ 2020-11-27T22:59:54.476+0000:
[~andygrove] why is it not practical to parse the JSON files first to get the schema?
Comment from Andy Grove(andygrove) @ 2021-02-24T01:51:01.855+0000:
Well, we could add schema inference but it could be slow for large JSON files especially where the schema varies between objects and where there are nested structs with varying schemas. Maybe there are two different stories here. 1) Support JSON using schema inference 2) Support JSON in a schemaless way. For example, if I run "SELECT a, b, c.d.e.f ..." I would expect to get NULLs for any of these attributes that do not exist on any particular row. On Fri, Nov 27, 2020 at 4:00 PM Neville Dipale (Jira)
I would like to implement it.
For schema inference, maybe only sampling for the first N items is enough. Schemaless JSON repression is much more difficult to implement, but there are limited usage scenarios, maybe.
@heymind I think adding support for reading JSON by sampling the first N items (where N is some configuration parameter) would be a valuable feature itself, even without schemaless JSON support.
We could file a follow on ticket for supporting JSON in a schemaless way.
In other words, breaking apart the two different stories sounds like a good idea to me
Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10118
Arrow already has a JSON reader and it would be nice to integrate this with DataFusion so that queries can be run against JSON files.
This would probably not be trivial though since we would need to add support for schemaless data sources (it isn't practical to parse the JSON files first to extract the schema).