[Rust] Add support for JSON data sources

alamb commented 3 years ago

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10118

Arrow already has a JSON reader and it would be nice to integrate this with DataFusion so that queries can be run against JSON files.

This would probably not be trivial though since we would need to add support for schemaless data sources (it isn't practical to parse the JSON files first to extract the schema).

alamb commented 3 years ago

Comment from Neville Dipale(nevi_me) @ 2020-11-27T22:59:54.476+0000:

[~andygrove] why is it not practical to parse the JSON files first to get the schema?

Comment from Andy Grove(andygrove) @ 2021-02-24T01:51:01.855+0000:

Well, we could add schema inference but it could be slow for large JSON
files especially where the schema varies between objects and where there
are nested structs with varying schemas.

Maybe there are two different stories here.

1) Support JSON using schema inference

2) Support JSON in a schemaless way. For example, if I run "SELECT a, b,
c.d.e.f ..." I would expect to get NULLs for any of these attributes that
do not exist on any particular row.

On Fri, Nov 27, 2020 at 4:00 PM Neville Dipale (Jira)

heymind commented 3 years ago

I would like to implement it.

For schema inference, maybe only sampling for the first N items is enough. Schemaless JSON repression is much more difficult to implement, but there are limited usage scenarios, maybe.

alamb commented 3 years ago

@heymind I think adding support for reading JSON by sampling the first N items (where N is some configuration parameter) would be a valuable feature itself, even without schemaless JSON support.

We could file a follow on ticket for supporting JSON in a schemaless way.

In other words, breaking apart the two different stories sounds like a good idea to me

apache / datafusion

[Rust] Add support for JSON data sources #103