evinism / mistql

A query / expression language for performing computations on JSON-like structures. Tuned for clientside ML feature extraction.
https://mistql.com
MIT License
355 stars 17 forks source link

Support JSON lines files #120

Open ivbeg opened 2 years ago

ivbeg commented 2 years ago

Please add support of JSON lines files https://jsonlines.org/ There are a lot of such files published and used. Sometimes they are huge and hard to convert to JSON

evinism commented 2 years ago

Fantastic idea! No timeline yet on implementation, but definitely a very useful feature. I've run into this myself :)

evinism commented 2 years ago

Actually @ivbeg, would you be able to describe your ideal interface for such a feature? Would the program run the query over each json line individually, or treat the whole file as a large array?

ivbeg commented 2 years ago

@evinism It would be great to support both ways to process JSON lines files, but streaming feature would be more important since there are huge JSON lines files, up to 100GB+ compressed. I could provide several examples from public datasets if needed. It's nearly impossible to process such files as a large array.

I've developed cmd tool undatum (https://github.com/datacoon/undatum) that support data processing and conversion of JSON lines and BSON files. BSON is a binary format used by MongoDB NoSQL database, very similar to JSON lines . So I would like to integrate query language into undatum to use it with data processing/conversion operations. I've already used dictquery (https://github.com/cyberlis/dictquery) but it's good for filtering only.

evinism commented 2 years ago

streaming mode for processing jsonl sounds right to me too. Not sure when I'll get to this, but definitely something I want to tackle.

ivbeg commented 2 years ago

@evinism I've added experimental support of mistql to undatum, it's supported in main https://github.com/datacoon/undatum version 1.0.13 command "undatum query -q \<yourquery> \<filename>" filename could be csv, jsonl or bson.

I hope it could help.

evinism commented 2 years ago

Adding @ilan-pinto to this thread. For now, let's work on getting this up and running in Python.

ilan-pinto commented 2 years ago

Hi please assign it to me

evinism commented 2 years ago

For reference, a possible interface for this feature could be as such:

tail file.log | python -m mistql.cli foo.bar --lines > processed.jsonl

Note that the query is performed in a streaming manner -- for each JSON line in file.log, the CLI spits out the query result for that line in processed.jsonl