Open JohannesOehm opened 2 years ago
Hi! Can you provide a small sample of such JSON?
Sure: foo.ndjson.txt
(had to change extension due to github extension issues).
https://codebeautify.org/json-decode-online
file is not json spec. is Json have "new-line delimited" spec variant? You try to read all file and convert it in memory. Its very huge and slow. Json parser read file by many smal parts (buffer size).
Yes, I'm aware, that is true. My file is not valid JSON, however, this format is commonly used in BigData environments. The specification is available here: http://ndjson.org/
API could be:
DataFrame.readNdJson(path: String, skip: Int? = null, limit: Int? = null)
It can be parsed line by line to JsonElement (or JsonObject?), then joined into JsonList and converted to dataframe by readJsonImpl
From https://www.atatus.com/glossary/jsonl/
When dealing with regular JSON, there is essentially just one course of action: load the entire dataset into memory and parse it. Although you can break an 11 GB file into smaller files without parsing the whole thing, search for a certain location inside JSON Lines, use CLI n-based tools, etc.
parameter filter: (JsonElement) -> Boolean could be useful too. will be easier to load only relevant parts of a big file
Thank you for the great project. It looks very promising to me.
I currently use
val df = DataFrame.readJsonStr(File("foo.ndjson").readLines().joinToString(",", "[", "]"))
, to read new-line delimited JSON files, which works quite well. However, it would be much more convinient if the API would offer such a function directly. It would be also nice if it would work directly on InputStreams, becausereadLines()
is already reading the entire file under the hood.