Data ingestion - best practices

Background I've been tasked with researching/evaluating different tools for the ingestion, storage, and retrieval of point cloud data to be integrated into an existing Python code base. So far, I've looked into TileDB, pgpointcloud, and Entwine, and I may continue to evaluate other options as time permits (and no solution has been found).

Details We have a variety of point cloud data sets (from the same source) available for me to use for testing. Some of my co-workers also develop a tool that can be used to generate point cloud data as another possible data source (whether for testing or use in a deployment environment).

We need to be able to use these point cloud data sets to generate elevation data, possibly on the fly, for use in a geospatial application. I'm trying to determine the best tool to use that would give us the balance of capabilities and performance and ease of integration into our Python-based back-end we use to query other geospatial data sets.

Issues I'm struggling to understand how exactly I need to configure the ingestion (i.e. entwine build ...) process to ensure it can successfully chew through ~7.5 billion data points - and that's just a small sample of what we need to support. Ideally, we'd store the coordinates in EPSG:4326 since we're already using that SRS for querying our existing geospatial data sets, but another SRS would suffice since we'd be able to convert the query coordinates to the data set's SRS beforehand if necessary.

I've tried entwine build -i <LA[SZ] files' parent directory> -o <output directory> -r "EPSG:4326" -t <threads>, but it eventually consumes all the RAM on the system and is killed by the OS. I've also tried changing the SRS to 3857 and 4978 just to see how it looks in Potree or the like. That worked for me a couple of times, but I have yet to successfully reproduce the results for other, larger data sets.

Can anyone provide some pointers on best practices on how to properly use Entwine to consistently, successfully ingest/store/retrieve point cloud data, regardless of the data set's size?

connormanning / entwine

Data ingestion - best practices #182