Add an optional MIME types VLR that describes other VLRs in the file

hobu commented 1 year ago

What is the issue about?

Inquiry about the specification

Issue description

Problem

I wish the LAS ecosystem had better VLR interoperability. Unless they are baked into the specification, VLRs are not really consumable without "just knowing" a particular user_id / record_id combination. Usually that's only your own VLRs, but maybe a particular application might know about one or two other application's VLRs and treat them accordingly.

One thing that is becoming increasingly needed is for storing metadata WITH the point cloud data. Sometimes that metadata is a full FGDC metadata document, sometimes it's just a Word .docx or a .pdf, or maybe it is a simple Markdown text file that describes the process of how the file was made. Regardless, the specification has no way to communicate the type of content inside a VLR. It would be really nice to be able to do this for metadata.

The NGA BPF Specification has a concept of a "Bundle File" that is a little like a VLR. The idea is to stuff whatever you want into a blob and give it a filename. The content type is implicitly defined by that filename's extension, however. There's no MIME type to explicitly tell you want that file is supposed to be. I think we could do better with LAS by providing an optional VLR that gives a simple map of user_id / record_id / mimetype / (optional) filename.

Proposal

[
    {
        "user_id":"PDAL",
        "record_id":12,
        "mimetype":"application/json",
        "description":"PDAL metadata output as a JSON document"
    },
    {
        "user_id":"USGS",
        "record_id":86,
        "mimetype":"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
        "filename":"metadata.docx",
        "description":"Random stuff pasted into a word document that MIGHT describe how the data came into being"
    }
]

Notes:

We should use JSON Schema to describe a schema document for these things (and any other JSON VLRs we might make).
This isn't a replacement for the header.
You are probably writing this at the end of the file as an EVLR since you don't know your content types until after you write them all

FAQ

Why make a new VLR instead of augmenting the current VLR headers?

Because it should be optional and we don't want to cause people to change any existing software.

Why use JSON?

It's what people use for this kind of thing nowadays. Depending on the schema, it can also be extendable so people can add their own stuff to it if they want. That said, I'm biased toward JSON as a contributing author to the GeoJSON specification, so take my suggestion accordingly 😛

hobu commented 7 months ago

Additional comments:

MIME types are how the internet communicates the content of files and protocols. Aligning LAS with these conventions will make it easier for people using LAS in the context of other systems to communicate the content of LAS files.
There are already MIME types registered for both LAS, LAZ, and BPF. The current list can be found at https://www.iana.org/assignments/media-types/media-types.xhtml
It is likely that this VLR would be written as an EVLR at the end of a file, but it wouldn't have to be.
I would propose that each entry be REQUIRED to contain only user_id, record_id and mimetype. Any other fields, including complex JSON objects if desired, would be explicitly allowed.
If a record_id/user_id pair is not matched in the file, it should be ignored. This would allow applications to write a stock VLR for all of the VLR mimetypes they might add to a file.
If there is an existing mimetype VLR/EVLR in the file, writers must APPEND their entries to the JSON block, not overwrite.

esilvia commented 2 months ago

Discussed in the LWG meeting today. Primary motivator is for those with large data holdings such as @kjwaters (NOAA) and @jdnimetz. I personally haven't seen many folks try to embed files like docx or xml or pdf etc into the (E)VLRs and so I don't see a lot of value. Maybe others have this problem? I'd love to get a few more opinions on the record.

There's a concern about having every LAS in an archive having the same multi-MB pdf in its header, causing potential storage bloat with limited advantage.

ASPRSorg / LAS