bluesky / tiled

API to structured data
https://blueskyproject.io/tiled
BSD 3-Clause "New" or "Revised" License
59 stars 50 forks source link

Support a fallback structure family for "opaque bytes" #434

Open danielballan opened 1 year ago

danielballan commented 1 year ago

This is another idea that arose during the NIST visit.

Tiled's data model constrains everything to be one of its recognized structure families (array, dataframe, sparse, node) or JSON-encodable metadata sitting alongside one of those types. There will be cases where there is binary (not JSON-encodable) information that is relevant and that some clients programs will know what to do with.

Our line on this so far has been, "If you have a files, use a static file server or Globus or another file-based solution, and link to that from the metadata in Tiled." And for cases where you have a lot of un-structured data (directory of PowerPoint documents or PDFs) I think think that's the right call. But John Henry at NIST articulated a compelling argument that it is useful to enable Tiled to carry binary data in-line when it's useful.

I think this would take the shape of a new structure family, perhaps opaque_bytes. Its structure would simply be a length, and it would be sliceable by byte range. Tiled would not be able to transcode it, only send it in its original representation as bytes. Any context necessary to interpret the bytes would have been either known a priori to a client. The (JSON-encodable) metadata attached to the opaque_bytes node may provide helpful information in this regard for a client, but it would be "opaque" to Tiled itself.

We are all in agreement that if you have mostly unstructured / opaque data, then Tiled is not adding value and you should just use a static file server. But if you have a little unstructured / opaque data and you want to place it logically alongside structure data, there is an argument that Tiled should enable this.

danielballan commented 1 year ago

One possible name for this proposed new structure family is unstructured. I think I like that better than my original suggestion, opaque_bytes.

danielballan commented 1 year ago

Like any other node in Tiled, an unstructured node can include metadata which a client or user may rely on the open/interpret it.

It's worth considering the trade-offs of promoting certain special fields in structure:

The only one we will always know is bytesize. The rest will need to be optional.

danielballan commented 1 year ago

We have discussed various bars that data can clear:

  1. We technically have the bytes, but everything is totally unlabeled.
  2. We have the bytes and some metadata.
  3. We know how to open the file and interpret the bytes as numbers.
  4. We have a schema that tells us in a "machine-actionable" way significance of the numbers.

Tiled currently insists that you start at (2). When @jmaruland and I first visited NIST, John Henry made the case that we should actually start at (1) --- that we should accept files that we cannot open. That leaves a lot of Tiled's capability on the table. Interfaces like these rely on Tiled being able to open the file and provide it in a known format:

https://tiled-demo.blueskyproject.io/ui/browse/generated/short_table https://tiled-demo.blueskyproject.io/ui/browse/fxi/raw/1b0b4d73-6d87-43ab-8d62-ed035c51b9b4/primary/data/Andor_image

Given "unstructured" data, Tiled would have to fall back to showing only a "Download" button and leave it to the client/user to figure out how to open the data.

padraic-shafer commented 1 year ago

One possible name for this proposed new structure family is unstructured. I think I like that better than my original suggestion, opaque_bytes.

I'll add a few more into the mix for consideration...because, well, naming things is hard. :)

I hadn't intended to complicate a simple keyword choice, but for some reason felt compelled to do so anyway. :)

danielballan commented 1 year ago

I think you've convinced me that unstructured is not quite right. From the point of view of the user, the data has a structure; it just has not been described to Tiled in a way that Tiled understands.

Of those options I think I like unknown best. Comments:

danielballan commented 1 year ago

I think the current proposal to beat is:

structure_family: unknown
structure:
  mimetype: "..."  # e.g. "application/octet-stream", "text/plain;chatset=utf-8"
  length: ...  # number of bytes

Maybe unspecified should also be considered?

padraic-shafer commented 1 year ago

I think the current proposal to beat is …

Agreed. I don’t have strong feelings re: unknown vs. unspecified. Maybe @prjemian and @dylanmcreynolds have a preference?

dylanmcreynolds commented 1 year ago

Slight preference for unspecified...unknown has a slight negative connotation. What about bytearray? I know there's a collision with python types, but isn't it precisely what we're describing? I think the term is pretty common across many languages.

danielballan commented 1 year ago

I had no idea that bytearray was a common term beyond Python. I'm open to it. I agree we want to avoid attaching a negative connotation to this.

dylanmcreynolds commented 1 year ago

octet stream ?

is used to indicate that a body contains arbitrary binary data

padraic-shafer commented 1 year ago

octet stream ?

is used to indicate that a body contains arbitrary binary data

That makes sense. Stream has its own baggage of expectations, but "application/octet-stream" is so commonly used that it's hard to argue against.

danielballan commented 1 year ago

We use application/octet-stream as a MIME type when we send numpy arrays (or chunks of numpy arrays) as C-ordered buffers:

$ git grep "application/octet-stream"
docs/source/explanations/compression.md:content-type: application/octet-stream
docs/source/tutorials/export.md:* C-ordered memory buffer `application/octet-stream`
share/tiled/static/default_ui_settings.yml:      - mimetype: application/octet-stream
tiled/_tests/test_writing.py:        assert value.startswith("data:application/octet-stream;base64,")
tiled/client/array.py:        media_type = "application/octet-stream"
tiled/client/array.py:                headers={"Content-Type": "application/octet-stream"},
tiled/client/array.py:                headers={"Content-Type": "application/octet-stream"},
tiled/media_type_registration.py:            if media_type in {"application/octet-stream", "text/plain"}:
tiled/media_type_registration.py:    "application/octet-stream",
tiled/media_type_registration.py:        "application/octet-stream",
tiled/media_type_registration.py:        "application/octet-stream",
tiled/media_type_registration.py:        "application/octet-stream",
tiled/media_type_registration.py:    for media_type in ["application/octet-stream", APACHE_ARROW_FILE_MIME_TYPE]:
tiled/serialization/array.py:    "application/octet-stream",
tiled/serialization/array.py:    "application/octet-stream",
tiled/server/core.py:    StructureFamily.array: {"*/*": "application/octet-stream", "image/*": "image/png"},
tiled/utils.py:            content = f"data:application/octet-stream;base64,{base64.b64encode(content).decode('utf-8')}"

Unlike TIFF or PNG or Arrow, the context necessary to interpret the C-ordered buffers (their data type and shape) is not inlined into the payload itself---it's in the structure JSON from a different endpoint. That's why we went with application/octet-stream, meaning, "If you don't already know what this binary data is, I can't help you here." A web browser, for example, would not be able to make sense of that as anything but "arbitrary binary data". It takes a Tiled-aware application to join this with the structure info and interpret it.

For category of use cases addressed by this GH issue, we may actually know a specific MIME type. Use cases include things like Word documents, MATLAB scripts, and PDFs, probably associated with some more structured scientific data. Tiled will not be able to transcode or slice into these nodes, but it can give the client a good hint by saying, "The person who gave me this said it was applicaiton/pdf. I hope that means something to you! Good luck!" And for browser, that will be a great hint.

So my initial reaction is that adding a MIME type like application/octet-stream to the StructureFamiy enum would be mixing things that shouldn't be mixed. We should pick a name that is not a MIME type because the node will also have a MIME type.

padraic-shafer commented 1 year ago

it's hard to argue against.

OK, I stand corrected. 😆

danielballan commented 1 year ago

it's hard to argue against.

dylanmcreynolds commented 1 year ago

This is getting silly, but the more I think about it, the more I think I like plain old bytes, even with the python type naming collision. What is it? It's bytes. What do we know about it? Nothing, other than than it's bytes.

danielballan commented 1 year ago

That's pretty compelling.

prjemian commented 1 year ago

Simplicity

padraic-shafer commented 1 year ago

It seems like we have a winner. Should we proceed with using bytes?

danielballan commented 1 year ago

Let's do it. #450 is a good reference for which parts of the codebase need to be touched to add a new StructureFamily.

Some design things to nail down before we write code.

dylanmcreynolds commented 1 year ago

Is there any reason that an adapter can't define structure-family = bytes but their own mime-type? It's hard for my brain to escape the notion that specific mime types could be very useful to clients.

danielballan commented 1 year ago

I think we're on the same page. Compare to this array example, which has a structure_family ("array") and a structure (see JSON below).

$ http https://tiled-demo.blueskyproject.io/api/v1/metadata/generated/small_image/ | jq .data.attributes.structure_family
"array"
$ http https://tiled-demo.blueskyproject.io/api/v1/metadata/generated/small_image/ | jq .data.attributes.structure
{
  "data_type": {
    "endianness": "little",
    "kind": "f",
    "itemsize": 8
  },
  "chunks": [
    [
      300
    ],
    [
      300
    ]
  ],
  "shape": [
    300,
    300
  ],
  "dims": null,
  "resizable": false
}

This proposal is that the structure_family would be "bytes" and the structure would be {"mimetype": "...", "length": N}.

padraic-shafer commented 1 year ago

I'd be interested in drafting a PR for this, along with some follow up discussions.

It would be great to have a companion for this. @jmaruland are you interested in working on this together?

jmaruland commented 1 year ago

@padraic-shafer Yes, I would love to. I worked on a very similar issue a while ago when we were trying to move away from JSONSchema models to Pydantic models. I will be fun to revisit this topic.

padraic-shafer commented 1 year ago

Fantastic! I'll find a time later this week for us to discuss where to start, and how to proceed.

danielballan commented 1 year ago

Follow-up thoughts here:

  1. We already plan to add a route for accessing underlying files, discussed in #473, something like /asset/{id}. The route we want for this issue has the same meaning: "Get me the underlying file." It should probably be the same route.

  2. Keeping in mind the rule, "Illegal or nonsensical states should be unrepresentable, I think we may not want to put the mimetype in the structure because it's already in the data_source:

    $ http :8000/api/v1/metadata/example?show_sources=true 'Authorization:Apikey secret' | jq .data.attributes.data_sources
    [
    {
    "id": 2,
    "structure": {
      "data_type": {
        "endianness": "little",
        "kind": "i",
        "itemsize": 8
      },
      "chunks": [
        [
          3
        ]
      ],
      "shape": [
        3
      ],
      "dims": null,
      "resizable": false
    },
    "mimetype": "application/x-zarr",
    "parameters": {},
    "assets": [
      {
        "data_uri": "file://localhost/tmp/tmpp0dp686u/data/example",
        "is_directory": true,
        "id": 2
      }
    ],
    "management": "writable"
    }
    ]

And there is space for a size under "assets". It's in the SQL database, just not exposed in the API yet. Maybe better to just refer to those as the truth and let the structure be null, same as it is for "container" structure family.

  1. This work will probably overlap a bit with #521 and should be loosely coordinated with it.