cityjson / cjio

CityJSON/io: Python CLI to process and manipulate CityJSON files
MIT License
118 stars 30 forks source link

CJIO fails on bigger Dutch OpenData files #54

Open rduivenvoorde opened 3 years ago

rduivenvoorde commented 3 years ago

In The Netherlands PDOK makes cityjson files availabe of NL:

https://brt.kadaster.nl/basisvoorziening-3d/

For example the part with my home town haarlem is this one:

https://download.pdok.nl/kadaster/basisvoorziening-3d/v1_0/2018/volledig/25az1.volledig.zip

The ZIP is 553Mb, unzipped it is 2.7Gb(!)

Trying to cut out a small peace (to load in in QGIS with the cityjson plugin):

cjio 25az1.json subset --bbox 104607 490148 104703 490257 save myarea.json

My (rather beefy Linux) laptop (16Gb, 8 threads) kills the process after a lot of swapping...

So my question:

Any hint on how to handle this opendata is appreciated :-)

hugoledoux commented 3 years ago

cjio works for me with one such file, but I have 32GB of RAM...

But the main issue is that the files distributed are indeed gigantic. I wish Kadaster could split the tiles into 4 or even 16 sub-tiles. We have told them, but if more do the same it could help.

rduivenvoorde commented 3 years ago

While doing some work for QGIS some months ago, I found that for GeoJSON there are some 'streaming'-variants: see for example https://gdal.org/drivers/vector/geojsonseq.html The file is just a list of objects which can be read in sequentially (without being read fully in memory first...)

This makes it possible (at least for GeoJSON) to do some sort of streaming reading and writing. Looking at the nature of CityJSON, I see it's not a plain array of objects (like you can see GeoJSON more or less), but IF we could come up with some kind of variant, this would make parsing of it a lot easier, as you do not need to first reed in ALL to get the info?

hugoledoux commented 3 years ago

You are right, and I'm fully aware. Worked on this a bit a few months ago, in the context of WFS3 but it would help here too. We developed CityJSON to be simple to process and analyse and manipulate, such large datasets were not our first priority.

The proposal for how it would work is there:

https://github.com/hugoledoux/cityjson_ogcapi/blob/master/best-practice.md

and there's code: https://github.com/hugoledoux/cityjson_ogcapi

We have an MSc student starting on this topic in September, so expect some results by Christmas!

justb4 commented 3 years ago

interesting, @hugoledoux you may want to get in touch with the pygeoapi team. at least I was not aware of this development. there is e.g. support for JSON-LD to link features. streaming would be very interesting. backed by a PG DB this was something the deegree WFS v2 already supported for complex appschema GML...

balazsdukai commented 2 years ago

Note to self: Currently (v0.7.3) cjio has a memory footprint of ~10x the file size when deserializing. This is due to the JSON decoder implementation in the standard json library that we use. The subset pushes the memory use even higher since there is a duplication of objects.

Briefly tried simdjson as a drop-in replacement for json but it is worse in terms of memory use, with 15x the file size when deserializing.

Minor optimizations, like using tuples instead of lists for storing the vertices have a negligible impact.

I think that the json decoder is as good as it gets. If we want lower memory footprint, then we need to look elsewhere.