ADAPT / Standard

ADAPT Standard data model issue management
https://adaptstandard.org
MIT License
6 stars 0 forks source link

Compression of ADAPT serialized datasets #127

Open knelson-farmbeltnorth opened 8 months ago

knelson-farmbeltnorth commented 8 months ago

Initial discussion in 1 Nov 2023 meeting about if/how ADAPT datasets should be compressed.

Agreement in the meeting that at no time should an ADAPT dataset contain compressed archives within compressed archives, or have an uncompressed adapt.json file with compressed sub files.

The question of how to compress the adapt.json and its consitituent geospatial files was not resolved, however.

Some participants were in favor of ADAPT making no requirement of how entire datasets should be compressed (or not compressed). Other participants suggested we find a compression standard that has wide support and require data be compressed by that and only that.

crutt commented 7 months ago

How about .tar.bz2, as it is an open format with wide usage and support, and doesn't support encryption? GZip is also a good candidate instead of BZip, if it is viewed as more available/accessible.

Here's my crack at outlining archive/compression support.

Single File Archive / Compression

Any system that supports the creation of an archive of ADAPT standard data SHALL conform to the Archive Structure, and MUST support the Standard Archive Format (tarball bzip2) as an option.

Systems that generate archives SHALL NOT require encryption or password protection.

Archive Structure

Standard Archive Format

The Standard Archive Format is a tarball bzip2 file with a .tar.bz2 extension.

Tarballs are an open standard for archiving multiple files into a single file, with broad support across operating systems and programming languages.

BZip2 compression is also an open standard with similar support and generally better compression than GZip.

Tar/bz2 support is widely available, and installed by default on many operating systems including macOS and many Linux distributions. On Windows, additional software may be required, such as 7-Zip or WSL.

Creating an archive

tar -cjf archive.tar.bz2 adapt.json ./geospatial/

-c: create a new archive -j: use bzip2 compression -f: specify the output file name

Extracting an archive

tar -xjf archive.tar.bz2

-x: extract files from an archive -j: use bzip2 compression -f: specify the output file name

knelson-farmbeltnorth commented 7 months ago

Agreement in 29 November 2023 to adopt approach above as a recommendation vs. a requirement.

Andreasox commented 3 months ago

Hi

As I am only an interesting reader of ADAPT, I hijack an earlier thread instead of creating a new one.

I note that GDAL has implemented GeoParquet spatial sorting functionality in https://github.com/OSGeo/gdal/pull/9185 which should substantially enhance the read speed of large files. Is this being considered in ADAPT?

Best regards

Andreas Oxenstierna Dalen Hörbyvägen 53 243 94 Höör 0730-26 97 12 On 13 Nov 2023, 16:43 +0100, Chris @.***>, wrote:

How about .tar.bz2, as it is an open format with wide usage and support, and doesn't support encryption? GZip is also a good candidate instead of BZip, if it is viewed as more available/accessible. Here's my crack at outlining archive/compression support. Single File Archive / Compression Any system that supports the creation of an archive of ADAPT standard data SHALL conform to the Archive Structure, and MUST support the Standard Archive Format (tarball bzip2) as an option. Systems that generate archives SHALL NOT require encryption or password protection. Archive Structure

• ./adapt.json • The adapt.json file is REQUIRED and MUST be at the root of the archive. • ./*/ • Additional files are OPTIONAL, and SHALL only be included if referenced in the adapt.json file. (ie. geospatial rasters/parquets).

Standard Archive Format The Standard Archive Format is a tarball bzip2 file with a .tar.bz2 extension. Tarballs are an open standard for archiving multiple files into a single file, with broad support across operating systems and programming languages. BZip2 compression is also an open standard with similar support and generally better compression than GZip. Tar/bz2 support is widely available, and installed by default on many operating systems including macOS and many Linux distributions. On Windows, additional software may be required, such as 7-Zip or WSL. Creating an archive tar -cjf archive.tar.bz2 adapt.json ./geospatial/ -c: create a new archive -j: use bzip2 compression -f: specify the output file name Extracting an archive tar -xjf archive.tar.bz2 -x: extract files from an archive -j: use bzip2 compression -f: specify the output file name — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

knelson-farmbeltnorth commented 3 months ago

@Andreasox I suspect your mention of it is the first many of us have heard of it. To date, our decisions have just been that vector data should be stored as GeoParquet, and, for common use cases mapping field coverage, all geometries should be polygons. The definition of all other columns is handled in the json header data, which map to the GeoParquet via column index.

Are you suggesting that we require the bbox column?

Andreasox commented 3 months ago

Hi

If purely to use as transfer format, then a spatial index is not necessary. But if to be displayed visuallly in for example QGIS, a bbox should substantially enhance the read speed of large files. I would recommend bbox, but not making it mandatory as I assume it is only GDAL/OGR that will make use of it. Note that  GDAL/OGR is used ”everywhere” in the GIS sector (including by QGIS) so its GeoParquet support may be widely used over time.

If relevant, I can test different performance scenarios if given relevant files.

Hälsningar

Andreas Oxenstierna Dalen Hörbyvägen 53 243 94 Höör 0730-26 97 12 On 22 Mar 2024, 16:11 +0100, Kelly Nelson @.***>, wrote:

@Andreasox I suspect your mention of it is the first many of us have heard of it. To date, our decisions have just been that vector data should be stored as GeoParquet, and, for common use cases mapping field coverage, all geometries should be polygons. The definition of all other columns is handled in the json header data, which map to the GeoParquet via column index. Are you suggesting that we require the bbox column? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

knelson-farmbeltnorth commented 3 months ago

We discussed in the 27 March 2024 meeting and are not going to require the bounding box data.