Kotlin / dataframe

Structured data processing in Kotlin
https://kotlin.github.io/dataframe/overview.html
Apache License 2.0
811 stars 57 forks source link

Enhancement: Support new-line delimited JSON format/JSON lines #127

Open JohannesOehm opened 2 years ago

JohannesOehm commented 2 years ago

Thank you for the great project. It looks very promising to me.

I currently use val df = DataFrame.readJsonStr(File("foo.ndjson").readLines().joinToString(",", "[", "]")), to read new-line delimited JSON files, which works quite well. However, it would be much more convinient if the API would offer such a function directly. It would be also nice if it would work directly on InputStreams, because readLines() is already reading the entire file under the hood.

koperagen commented 2 years ago

Hi! Can you provide a small sample of such JSON?

JohannesOehm commented 2 years ago

Sure: foo.ndjson.txt

(had to change extension due to github extension issues).

slavonnet commented 1 year ago

https://codebeautify.org/json-decode-online

file is not json spec. is Json have "new-line delimited" spec variant? You try to read all file and convert it in memory. Its very huge and slow. Json parser read file by many smal parts (buffer size).

JohannesOehm commented 1 year ago

Yes, I'm aware, that is true. My file is not valid JSON, however, this format is commonly used in BigData environments. The specification is available here: http://ndjson.org/

koperagen commented 2 months ago

API could be: DataFrame.readNdJson(path: String, skip: Int? = null, limit: Int? = null) It can be parsed line by line to JsonElement (or JsonObject?), then joined into JsonList and converted to dataframe by readJsonImpl

From https://www.atatus.com/glossary/jsonl/

When dealing with regular JSON, there is essentially just one course of action: load the entire dataset into memory and parse it. Although you can break an 11 GB file into smaller files without parsing the whole thing, search for a certain location inside JSON Lines, use CLI n-based tools, etc.

parameter filter: (JsonElement) -> Boolean could be useful too. will be easier to load only relevant parts of a big file