connormanning / entwine

Entwine - point cloud organization for massive datasets
https://entwine.io
Other
451 stars 128 forks source link

Data ingestion - best practices #182

Closed thompsonab closed 5 years ago

thompsonab commented 5 years ago

Background I've been tasked with researching/evaluating different tools for the ingestion, storage, and retrieval of point cloud data to be integrated into an existing Python code base. So far, I've looked into TileDB, pgpointcloud, and Entwine, and I may continue to evaluate other options as time permits (and no solution has been found).

Details We have a variety of point cloud data sets (from the same source) available for me to use for testing. Some of my co-workers also develop a tool that can be used to generate point cloud data as another possible data source (whether for testing or use in a deployment environment).

We need to be able to use these point cloud data sets to generate elevation data, possibly on the fly, for use in a geospatial application. I'm trying to determine the best tool to use that would give us the balance of capabilities and performance and ease of integration into our Python-based back-end we use to query other geospatial data sets.

Issues I'm struggling to understand how exactly I need to configure the ingestion (i.e. entwine build ...) process to ensure it can successfully chew through ~7.5 billion data points - and that's just a small sample of what we need to support. Ideally, we'd store the coordinates in EPSG:4326 since we're already using that SRS for querying our existing geospatial data sets, but another SRS would suffice since we'd be able to convert the query coordinates to the data set's SRS beforehand if necessary.

I've tried entwine build -i <LA[SZ] files' parent directory> -o <output directory> -r "EPSG:4326" -t <threads>, but it eventually consumes all the RAM on the system and is killed by the OS. I've also tried changing the SRS to 3857 and 4978 just to see how it looks in Potree or the like. That worked for me a couple of times, but I have yet to successfully reproduce the results for other, larger data sets.

Can anyone provide some pointers on best practices on how to properly use Entwine to consistently, successfully ingest/store/retrieve point cloud data, regardless of the data set's size?

hobu commented 5 years ago

This ticket is not a bug, it is a request for consulting. Luckily, we are available for consulting at https://hobu.co

The short answer to your query is you need to process your data in splits and merge them back together. That is how the large data services at https://usgs.entwine.io, https://hobbslidar.com or http://potree.entwine.io are made.