SlideRuleEarth / sliderule

Server and client framework for on-demand science data processing in the cloud
https://slideruleearth.io
Other
26 stars 11 forks source link

Sort GeoParquet files by time #244

Open jpswinski opened 1 year ago

jpswinski commented 1 year ago

Currently the data in a GeoParquet file is written in the order it is produced/received by the ParquetBuilder object. This is efficient for writing, but not as efficient when reading and manipulating the GeoParquet file later on.

The ParquetBuilder code needs to have a feature where it supports sorting the data by the time index and writing it sorted into the file.

One idea is that the code could first let the data be entirely written to the file. Then it could read back out of the file just the times with their corresponding indexes (which should be two 64-bit integers), and then sort the time/index pairs and use the newly ordered indexes to read back out of the file and write a new file.

For extremely large files, there can be another algorithm that sorts a maximum of ~10M time/index pairs (160MB) at a time and writes intermediate files to disk. The code could then open up each file and use read pointers set to the start of each file and then use those pointers to write the next oldest entry out of all the open files to a new final file. This would be slow and use a lot of disk space, but it would be a way to support extremely large files (files whose time/index pairs would exhaust available memory).

jpswinski commented 10 months ago

Another idea - for each column have a file that all data is written to, while storing the time column and an index in memory. Sort the time/index array in memory and then read each column back into memory and sort it based on the sorted indices.