TileDB-Inc / TileDB

The Universal Storage Engine
https://tiledb.com
MIT License
1.8k stars 179 forks source link

Using laz-perf as a LAZ compressor? #3074

Open ryan-salo opened 2 years ago

ryan-salo commented 2 years ago

I've been looking into storing pointcloud data in TileDB arrays, but one of my hesitations is the larger data volumes relative to LAZ files. I played around with different compressors/levels for coords/attributes, but couldn't get anything close to the original LAZ filesize.

Would it be possible to link in laz-perf and provide laz as an additional compressor?

stavrospapadopoulos commented 2 years ago

Hi @ryan-salo, we will soon publish several tutorials on tweaking the TileDB compression for LAZ data. The current defaults are not appropriate, we are fixing those in the next imminent release.

To achieve even better compression, we are designing a new compressor that will be especially beneficial for the GpsTime field that is of type double. This is what is hurting TileDB vs. LAZ currently, not the rest of the fields which compress pretty well with off-the-shelf compressors (like zstd and bzip2). To address this issue, the new compressor:

  1. Sorts on GpsTime within the GpsTime and X, Y and Z tiles (without impacting the rest of the attributes)
  2. Computes and sorts the pairwise XORs of the sorted GpsTime values
  3. Compresses the result with bzip2

In my local experiments, the above achieves massive compression for GpsTime (~10x versus 2x we currently achieve with zstd). I believe that will get TileDB to be on par with LAZ in terms of data sizes.

The reason why we don't use laz-perf off-the-shelf is that TileDB is a columnar format (like Parquet) and stores the values of each field/attribute in separate files. If we coalesced the fields, then we would hinder the ability to rapidly subselect on a subset of the fields, so performance would be impacted significantly. I believe that the new compressor we are working on will achieve the desired compression ratio.

I'll keep you posted on progress on this issue. Thanks for reaching out!

ryan-salo commented 2 years ago

Thanks for the response @stavrospapadopoulos! I'll keep me eyes on this repo for the next release. Sounds like some good improvements are coming!

ryan-salo commented 1 year ago

Just noticed the floating scaling compressor in the latest release, 2.11. Any thoughts on if this would improve pointcloud storage/compression?

stavrospapadopoulos commented 1 year ago

Hi @ryan-salo, it probably will for the case of X, Y and Z. Please stay tuned though, we are working on another compressor that will improve even further the pointcloud storage (specifically the GPSTime field). We'll experiment with all new compressors and select the best defaults in our PDAL ingestor.