Redundant fields in COPC header

connormanning commented 3 years ago

The COPC offsets below are redundant info which is already contained in the LAS VLRs directly:

  uint64_t laz_vlr_offset;      // File offset of the *data* of the LAZ VLR
  uint64_t laz_vlr_size;        // Size of the *data* of the LAZ VLR.
  uint64_t wkt_vlr_offset;      // File offset of the *data* of the WKT VLR if it exists, 0 otherwise
  uint64_t wkt_vlr_size;        // Size of the *data* of the WKT VLR if it exists, 0 otherwise
  uint64_t eb_vlr_offset;       // File offset of the *data* of the extra bytes VLR if it exists, 0 otherwise
  uint64_t eb_vlr_size;         // Size of the *data* of the extra bytes VLR if it exists, 0 otherwise

What is the use-case for this data being duplicated? It seems error-prone to have the same data in multiple spots, and also seems to go against the push behind COPC - to live as a spatially organized structure inside a format that we don't have to invent. Here we are reinventing (a subset of) VLR header information in our own header format.

As for implementing the spec, if you are using a COPC or LAS library like the ones I linked above, you have the VLR information packaged for you for free. And if not, you are going to be manually parsing out the LAS header, the bodies of the VLRs (even if not their headers due to this duplication), walking and parsing the hierarchy, and others. The few lines of VLR header parsing are comparatively trivial.

Both current implementations copc-lib and copc.js awaken the VLR headers on initialization and then ignore these COPC header values in favor of the VLR headers.

And as for reading the data, the magnitudes here are on the order of a few hundred bytes and would generally only be parsed once per reader instance, so I'm skeptical of a big performance impact.

So is there a concrete use-case for this data to exist? I would consider removing it to keep the format simple and less error-prone in the absence of a concrete need.

abellgithub commented 3 years ago

My code uses this data. It doesn't know a thing about VLRs. I'm not sure why you wrote your code as you did.

abellgithub commented 3 years ago

The primary reason for this data is to eliminate round-trips. If you don't care about round-trips at startup, that's up to you, but it was my understanding that eliminating round-trips that would lead to startup delay was important.

hobu commented 3 years ago

It doesn't know a thing about VLRs

It knows a lot about LAS though, which has VLRs defined in them and is what we're doing here. It seems like an indicator to me that the first two client implementations didn't bother to take advantage of this special redundant information. This is because both libraries already have the LAS manacles anyway.

The bullets describing how a client can implement a COPC reader at https://github.com/copcio/copcio.github.io/blob/main/DRAFT-SPEC.md#reader-implementation-notes disregard the fact that clients must potentially do more things that are LAS-related:

Are there extra byte VLRs? I need to unpack that VLR and accommodate if so
Is this a LAZ 1.4 file? The only way to know if it is an LAZ file is to walk the VLRs and look for the laszip VLR.
What am I supposed to do if the offset values in entwine VLR are out of sync with what's actually written?

The primary reason for this data is to eliminate round-trips

I can read the first 64k of the file on the first read and almost always get everything I need VLR-wise. I would rather do that than read 549 bytes to get the offsets and then start skipping around and RTT'ing to places that are most likely in that first 64k of data I fetched anyway. This is especially true if I have to interpret VLRs for other reasons along the way anyway.

My points against this restriction:

It breaks DRY potentially leading to data out of sync
Caring about VLR ordering implies that COPC is a very special LAZ file. We want it to be treated by COPC-unaware clients just like a normal LAZ file insofar as reading data is concerned.
If/when there is a LAS 1.5 and LAZ 1.5 that changes the header and whatever else, extending it with COPC will mean clients need to seek to different special locations and do different things.

CCInc commented 3 years ago

I agree with Howard and Connor, I think COPC libraries should build upon existing LAS implementations rather than need to reinvent the wheel. Making COPC writers have to sync these fields introduces an unnecessary level of complexity IMO.

hobu commented 3 years ago

Removed in #32

copcio / copcio.github.io

Redundant fields in COPC header #21