data source from JSON array data

tychoish commented 9 months ago

Presently, we have ndjson support for reading files and endpoints with new-line delimited json. This is great for dealing with files that store sequences or streams of document-structured data; however, many JSON API's return sequences of documents as members of a JSON array, and we should add support for accessing these kinds of data sources, which will help connect GlareDB with potential users who have json data from other sources.

Components

I've just split aspects out into components.

[ ] read_arjson ("array json"): table func: these should read the data source (local file or URL, or object store, consistent with other functions) into a memory table with one row for every element in the array. There should be an option for maximum data size, overwhich the table will result in an error.
[ ] COPY TO support, [open question: handling of null values in tables (omit vs preserve), I lean towards omit]
[ ] ability to add data source of arjson data in the catalog with the appropriate options.

Semantics

schema inference: read every document in the array, construct a schema that is the superset of all documents
handling unexpected first level data structures: initially we should error if any element of the top-level array is either a primitive value (number string, etc.) or an array.
nested data structures: arrays can become lists, documents can become structs or stay as json strings
type mismatch: error/abort of a value appears as incompatible types in different documents, it can be an error, following existing rules is fine.

Future Work

specifying some kind of common set of "reading documents into tables/dataframe" options, as schema inference is something we'll end up doing more of and we should make these arguments consistent.
read_json function which reads the first character of the file, determine whether to use ar or nd rules.
research: can streaming tables be used to have "partitions" with different schemas as part of the same addressable unit.
handling type mismatches: we should continue doing whatever we do now, but in the future it might be interesting to make the handling of situations where "field id is either a string or an integer"
provide a column type that is document-structure aware (json, binary representation, etc.) with corresponding query options..
advanced schema inference options:
- read the first <n> (configurable) documents and inferring the schema from those documents
- schema inferred to be the common subset of all documents/fields.
- some kind of random sampling, but otherwise the same as the section option.
- user specified schema and types.
nested type handling structures
- documents/arrays as arrow structs (including top-level, for a sort of document-database feel)
- flatten documents into columns with dotted notation.
type conversions or handling of different type-mismatches.
- option to insert nulls for rows where the type is a mismatch
- devolve to a document-encoding-as-string.
- schema inference should upscale numeric types and permit upcasting
- string columns should allow any json-encoded value (I think its fine to error if a number/bool appear in field that has strings, but allow documents and arrays remain in their json encoded form.
- we should consider nuls untyped.

Lilit0x commented 9 months ago

Hi there, @tychoish. This is interesting and I'd like to work on it, but I have some questions mostly on the semantics.

"nested data structures". I don't understand your point here.
"type mismatch": For example
```
[
{
  "name": "John",
  "city": "San Francisco"
},
{
  "name": "Alice",
  "city": "New York"
},
{
  "name": "Bob",
  "city": 893
}
]
```
My question is this, why not type cast the value to the already recorded type? What I mean is that since city is already known to be a string, any value of city that deviates from that should be typecasted to a string too.
On the schema inference, won't it be an overhead to read all the elements in the array to form a superset of the documents, before then actually processing the data? Parallel iteration with rayon might solve it. (I think we should start with the n configurable docs and inferring the schema from those).
Furthermore, should the inference be on all the documents that fall in the range of 0 and the maximum size specified? If yes, what if there are other additional elements in the remaining documents that fall out of the range?
How to handle missing elements from documents, I assume the straight forward answer is that they'll be null?

universalmind303 commented 9 months ago

potential duplicate of https://github.com/GlareDB/glaredb/issues/2028

tychoish commented 9 months ago

On the schema inference, won't it be an overhead to read all the elements in the array to form a superset of the documents, before then actually processing the data?

I thought this too initially, but because of these JSON documents need to be read into memory entirely, I don't think that there's much to be won by doing it parallel or overthinking it at least in the beginning. (we can think about optimizing this in different ways in the future but for now I think it's fine to read these documents into maps of strings to serde_json::Value objects. Once the documents are all stored in maps we have a lot of options for iterating things.

(I think we should start with the n configurable docs and inferring the schema from those).

Let's start with doing them all for now since they're all going to be in memory, and doing two passes probably isn't going to matter.

I also think that we should probably not allow/support pagination.

My question is this, why not type cast the value to the already recorded type? What I mean is that since city is already known to be a string, any value of city that deviates from that should be typecasted to a string too.

The problem with this is that it's sort of order-dependent: what happens if we read the documents in a different order? Which one do we pick?

"nested data structures". I don't understand your point here.

I was thinking about the services field in these documents. You could imagine also sub-documents.

{"station":"franklin ave", "services":["1","2","3","4","5","S"]}
{"station":"Broadway–Lafayette Street/Bleecker Street","services":["B","D","F","M","6"]}

Furthermore, should the inference be on all the documents that fall in the range of 0 and the maximum size specified? If yes, what if there are other additional elements in the remaining documents that fall out of the range?

I think when we're doing schema inference, I think we should ignore documents that fall outside of the range, though I'm perfectly happy to fall back to whatever the the ndjson handling does.

I think in general my preference is to start with not inferring schemas (read all elements), then having user-specified schema (fields and types), then having some kind of (effectively probabilistic) schema.

How to handle missing elements from documents, I assume the straight forward answer is that they'll be null?

Yeah, I think null for any missing value is fine. The one challenge is when writing a json document off of a schema that was created from a different JSON document, you can't round trip a "null" from the input document. My preference when writing rows from glaredb to JSON is to just omit fields from documents that are null. We should document this, but I think it's fine.

tychoish commented 8 months ago

@Lilit0x are you still interested in doing this? is there anything I can do to help? It's cool if not.

Lilit0x commented 8 months ago

@Lilit0x are you still interested in doing this? is there anything I can do to help? It's cool if not.

Hi there. Yes, I am. I'll start it soon if it is not urgent. I just want to get some things sorted before I work on it.

tychoish commented 8 months ago

@Lilit0x totally cool. this is an exciting feature for me and I just wanted to see where things were, and also make sure that there weren't any parts that were confusing. Do let me know if I can help (or if there's a better way for you to coordinate with me/us!)

universalmind303 commented 8 months ago

I'm a little late to the party here, and i know @Lilit0x already started working on this a bit, but I don't think we should try to reinvent the wheel too much on this one.

While arrow-rs doesn't have a json implementation, there are other high performance arrow based json readers we can draw inspiration from. Polars and arrow2 are both good examples. The arrow2 implementation is much simpler, and could be a good starting point for us. The polars implementation is quite a bit more complex due to the optimization techniques it uses.

tychoish commented 8 months ago

Definitely agree @universalmind303 about not reinventing the wheel, but what are you proposing here? It doesn't look like there's an obvious path forward to using either implementation directly, unless you have something else in mind. @Lilit0x's work seems reasonable.

universalmind303 commented 8 months ago

Definitely agree @universalmind303 about not reinventing the wheel, but what are you proposing here? It doesn't look like there's an obvious path forward to using either implementation directly, unless you have something else in mind. @Lilit0x's work seems reasonable.

A lot of the problems mentioned (schema inference, supertyping, error handling, ...) are already solved in these libraries, instead of trying to re-solve those same problems, we should just do what they do. Our implementation may be slightly different, but the same rules apply, so we can just follow their logic & implement it for our use case instead of trying to define our own logic and rules for handling json data.

So I think the work in https://github.com/GlareDB/glaredb/pull/2306 is a good starting point. What I'm suggesting is that while finishing up #2306 we should use polars and arrow2 as models for what our own implementation should look like. If there are unsolved problems such as supertyping or schema inference, we should look to those for solutions before DIYing our own.

Lilit0x commented 8 months ago

Hi @tychoish , I apologize for the slow updates in development. I have a faulty machine and I haven't been able to repair it since. I have to make use of an online VM to write code and it is not the best experience.

I think I understand @universalmind303's concern. These libraries have solved most of the problems that relates to parsing JSON and instead of writing ours we can just look into theirs, adapt it to work with the current workflow and that should speed up the development. That being said, I already looked into arrow2 and used the deserialization process from the library, although I am facing Unequal length error in StructBuilder and Fields when parsing nested objects, I should be able to figure it out and complete the first task in this issue.

tychoish commented 6 months ago

With #2729, I think this is pretty feature complete.

GlareDB / glaredb