GlareDB / glaredb

GlareDB: An analytics DBMS for distributed data
https://glaredb.com
GNU Affero General Public License v3.0
637 stars 36 forks source link

data source from JSON array data #2218

Closed tychoish closed 6 months ago

tychoish commented 9 months ago

Presently, we have ndjson support for reading files and endpoints with new-line delimited json. This is great for dealing with files that store sequences or streams of document-structured data; however, many JSON API's return sequences of documents as members of a JSON array, and we should add support for accessing these kinds of data sources, which will help connect GlareDB with potential users who have json data from other sources.

Components

I've just split aspects out into components.

Semantics

Future Work

Lilit0x commented 9 months ago

Hi there, @tychoish. This is interesting and I'd like to work on it, but I have some questions mostly on the semantics.

universalmind303 commented 9 months ago

potential duplicate of https://github.com/GlareDB/glaredb/issues/2028

tychoish commented 9 months ago

On the schema inference, won't it be an overhead to read all the elements in the array to form a superset of the documents, before then actually processing the data?

I thought this too initially, but because of these JSON documents need to be read into memory entirely, I don't think that there's much to be won by doing it parallel or overthinking it at least in the beginning. (we can think about optimizing this in different ways in the future but for now I think it's fine to read these documents into maps of strings to serde_json::Value objects. Once the documents are all stored in maps we have a lot of options for iterating things.

(I think we should start with the n configurable docs and inferring the schema from those).

Let's start with doing them all for now since they're all going to be in memory, and doing two passes probably isn't going to matter.

I also think that we should probably not allow/support pagination.

My question is this, why not type cast the value to the already recorded type? What I mean is that since city is already known to be a string, any value of city that deviates from that should be typecasted to a string too.

The problem with this is that it's sort of order-dependent: what happens if we read the documents in a different order? Which one do we pick?

"nested data structures". I don't understand your point here.

I was thinking about the services field in these documents. You could imagine also sub-documents.

{"station":"franklin ave", "services":["1","2","3","4","5","S"]}
{"station":"Broadway–Lafayette Street/Bleecker Street","services":["B","D","F","M","6"]}

Furthermore, should the inference be on all the documents that fall in the range of 0 and the maximum size specified? If yes, what if there are other additional elements in the remaining documents that fall out of the range?

I think when we're doing schema inference, I think we should ignore documents that fall outside of the range, though I'm perfectly happy to fall back to whatever the the ndjson handling does.

I think in general my preference is to start with not inferring schemas (read all elements), then having user-specified schema (fields and types), then having some kind of (effectively probabilistic) schema.

How to handle missing elements from documents, I assume the straight forward answer is that they'll be null?

Yeah, I think null for any missing value is fine. The one challenge is when writing a json document off of a schema that was created from a different JSON document, you can't round trip a "null" from the input document. My preference when writing rows from glaredb to JSON is to just omit fields from documents that are null. We should document this, but I think it's fine.

tychoish commented 8 months ago

@Lilit0x are you still interested in doing this? is there anything I can do to help? It's cool if not.

Lilit0x commented 8 months ago

@Lilit0x are you still interested in doing this? is there anything I can do to help? It's cool if not.

Hi there. Yes, I am. I'll start it soon if it is not urgent. I just want to get some things sorted before I work on it.

tychoish commented 8 months ago

@Lilit0x totally cool. this is an exciting feature for me and I just wanted to see where things were, and also make sure that there weren't any parts that were confusing. Do let me know if I can help (or if there's a better way for you to coordinate with me/us!)

universalmind303 commented 8 months ago

I'm a little late to the party here, and i know @Lilit0x already started working on this a bit, but I don't think we should try to reinvent the wheel too much on this one.

While arrow-rs doesn't have a json implementation, there are other high performance arrow based json readers we can draw inspiration from. Polars and arrow2 are both good examples. The arrow2 implementation is much simpler, and could be a good starting point for us. The polars implementation is quite a bit more complex due to the optimization techniques it uses.

tychoish commented 8 months ago

Definitely agree @universalmind303 about not reinventing the wheel, but what are you proposing here? It doesn't look like there's an obvious path forward to using either implementation directly, unless you have something else in mind. @Lilit0x's work seems reasonable.

universalmind303 commented 8 months ago

Definitely agree @universalmind303 about not reinventing the wheel, but what are you proposing here? It doesn't look like there's an obvious path forward to using either implementation directly, unless you have something else in mind. @Lilit0x's work seems reasonable.

A lot of the problems mentioned (schema inference, supertyping, error handling, ...) are already solved in these libraries, instead of trying to re-solve those same problems, we should just do what they do. Our implementation may be slightly different, but the same rules apply, so we can just follow their logic & implement it for our use case instead of trying to define our own logic and rules for handling json data.

So I think the work in https://github.com/GlareDB/glaredb/pull/2306 is a good starting point. What I'm suggesting is that while finishing up #2306 we should use polars and arrow2 as models for what our own implementation should look like. If there are unsolved problems such as supertyping or schema inference, we should look to those for solutions before DIYing our own.

Lilit0x commented 8 months ago

Hi @tychoish , I apologize for the slow updates in development. I have a faulty machine and I haven't been able to repair it since. I have to make use of an online VM to write code and it is not the best experience.

I think I understand @universalmind303's concern. These libraries have solved most of the problems that relates to parsing JSON and instead of writing ours we can just look into theirs, adapt it to work with the current workflow and that should speed up the development. That being said, I already looked into arrow2 and used the deserialization process from the library, although I am facing Unequal length error in StructBuilder and Fields when parsing nested objects, I should be able to figure it out and complete the first task in this issue.

tychoish commented 6 months ago

With #2729, I think this is pretty feature complete.