Open danielballan opened 1 year ago
One possible name for this proposed new structure family is unstructured
. I think I like that better than my original suggestion, opaque_bytes
.
Like any other node in Tiled, an unstructured
node can include metadata
which a client or user may rely on the open/interpret it.
It's worth considering the trade-offs of promoting certain special fields in structure
:
The only one we will always know is bytesize
. The rest will need to be optional.
We have discussed various bars that data can clear:
Tiled currently insists that you start at (2). When @jmaruland and I first visited NIST, John Henry made the case that we should actually start at (1) --- that we should accept files that we cannot open. That leaves a lot of Tiled's capability on the table. Interfaces like these rely on Tiled being able to open the file and provide it in a known format:
https://tiled-demo.blueskyproject.io/ui/browse/generated/short_table https://tiled-demo.blueskyproject.io/ui/browse/fxi/raw/1b0b4d73-6d87-43ab-8d62-ed035c51b9b4/primary/data/Andor_image
Given "unstructured" data, Tiled would have to fall back to showing only a "Download" button and leave it to the client/user to figure out how to open the data.
One possible name for this proposed new structure family is
unstructured
. I think I like that better than my original suggestion,opaque_bytes
.
I'll add a few more into the mix for consideration...because, well, naming things is hard. :)
bytes
are what you have and the term is readily understood. To avoid a naming clash with the python built-in, raw_bytes
or raw
might work instead.unstructured
is a nice foil to structure_family
/StructureFamily
but it's a bit of a misnomer; one of these might be closer to the meaning in this context: unknown
, unrecognized
, or undefined
.
UnknownStructureFamily
; hopefully these would not get confused down the road.custom
or user_defined
I hadn't intended to complicate a simple keyword choice, but for some reason felt compelled to do so anyway. :)
I think you've convinced me that unstructured
is not quite right. From the point of view of the user, the data has a structure; it just has not been described to Tiled in a way that Tiled understands.
Of those options I think I like unknown
best. Comments:
bytes
, including data with a known structure family, I think there's potential for confusion there. And we may want to reserve the term "raw" for https://github.com/bluesky/tiled/issues/277.custom
or user_defined
because I consider structure family to be intentionally not an extension point in Tiled. I think a custom, user-defined structure family would involve configuring the server and client(s)---possibly in multiple programming languages---to understand it, at least as far as getting from bytes to numbers, if not all the way to the meaning of the numbers. What is being proposed in this issue is not that. It is a escape hatch that says, "Tiled will send this data as is, and it's up to the client(s) to have some a priori knowledge of how to decode it. Tiled's existing mechanisms---the structure
filed and the transcoding mechanisms---cannot help."I think the current proposal to beat is:
structure_family: unknown
structure:
mimetype: "..." # e.g. "application/octet-stream", "text/plain;chatset=utf-8"
length: ... # number of bytes
Maybe unspecified
should also be considered?
I think the current proposal to beat is …
Agreed. I don’t have strong feelings re: unknown
vs. unspecified
. Maybe @prjemian and @dylanmcreynolds have a preference?
Slight preference for unspecified
...unknown
has a slight negative connotation. What about bytearray
? I know there's a collision with python types, but isn't it precisely what we're describing? I think the term is pretty common across many languages.
I had no idea that bytearray
was a common term beyond Python. I'm open to it. I agree we want to avoid attaching a negative connotation to this.
is used to indicate that a body contains arbitrary binary data
is used to indicate that a body contains arbitrary binary data
That makes sense. Stream has its own baggage of expectations, but "application/octet-stream" is so commonly used that it's hard to argue against.
We use application/octet-stream
as a MIME type when we send numpy arrays (or chunks of numpy arrays) as C-ordered buffers:
$ git grep "application/octet-stream"
docs/source/explanations/compression.md:content-type: application/octet-stream
docs/source/tutorials/export.md:* C-ordered memory buffer `application/octet-stream`
share/tiled/static/default_ui_settings.yml: - mimetype: application/octet-stream
tiled/_tests/test_writing.py: assert value.startswith("data:application/octet-stream;base64,")
tiled/client/array.py: media_type = "application/octet-stream"
tiled/client/array.py: headers={"Content-Type": "application/octet-stream"},
tiled/client/array.py: headers={"Content-Type": "application/octet-stream"},
tiled/media_type_registration.py: if media_type in {"application/octet-stream", "text/plain"}:
tiled/media_type_registration.py: "application/octet-stream",
tiled/media_type_registration.py: "application/octet-stream",
tiled/media_type_registration.py: "application/octet-stream",
tiled/media_type_registration.py: "application/octet-stream",
tiled/media_type_registration.py: for media_type in ["application/octet-stream", APACHE_ARROW_FILE_MIME_TYPE]:
tiled/serialization/array.py: "application/octet-stream",
tiled/serialization/array.py: "application/octet-stream",
tiled/server/core.py: StructureFamily.array: {"*/*": "application/octet-stream", "image/*": "image/png"},
tiled/utils.py: content = f"data:application/octet-stream;base64,{base64.b64encode(content).decode('utf-8')}"
Unlike TIFF or PNG or Arrow, the context necessary to interpret the C-ordered buffers (their data type and shape) is not inlined into the payload itself---it's in the structure
JSON from a different endpoint. That's why we went with application/octet-stream
, meaning, "If you don't already know what this binary data is, I can't help you here." A web browser, for example, would not be able to make sense of that as anything but "arbitrary binary data". It takes a Tiled-aware application to join this with the structure
info and interpret it.
For category of use cases addressed by this GH issue, we may actually know a specific MIME type. Use cases include things like Word documents, MATLAB scripts, and PDFs, probably associated with some more structured scientific data. Tiled will not be able to transcode or slice into these nodes, but it can give the client a good hint by saying, "The person who gave me this said it was applicaiton/pdf
. I hope that means something to you! Good luck!" And for browser, that will be a great hint.
So my initial reaction is that adding a MIME type like application/octet-stream
to the StructureFamiy
enum would be mixing things that shouldn't be mixed. We should pick a name that is not a MIME type because the node will also have a MIME type.
it's hard to argue against.
OK, I stand corrected. 😆
it's hard to argue against.
This is getting silly, but the more I think about it, the more I think I like plain old bytes
, even with the python type naming collision. What is it? It's bytes
. What do we know about it? Nothing, other than than it's bytes
.
That's pretty compelling.
Simplicity
It seems like we have a winner. Should we proceed with using bytes
?
Let's do it. #450 is a good reference for which parts of the codebase need to be touched to add a new StructureFamily.
Some design things to nail down before we write code.
mimetype
(required) and length
(required). If MIME type is unknown, we can use MIME types own catch-all (application/octet-stream
). MIME type also has a way to provide text-vs-binary and encoding./bytes/full/{path}
?Is there any reason that an adapter can't define structure-family = bytes
but their own mime-type? It's hard for my brain to escape the notion that specific mime types could be very useful to clients.
I think we're on the same page. Compare to this array example, which has a structure_family
("array"
) and a structure
(see JSON below).
$ http https://tiled-demo.blueskyproject.io/api/v1/metadata/generated/small_image/ | jq .data.attributes.structure_family
"array"
$ http https://tiled-demo.blueskyproject.io/api/v1/metadata/generated/small_image/ | jq .data.attributes.structure
{
"data_type": {
"endianness": "little",
"kind": "f",
"itemsize": 8
},
"chunks": [
[
300
],
[
300
]
],
"shape": [
300,
300
],
"dims": null,
"resizable": false
}
This proposal is that the structure_family
would be "bytes"
and the structure
would be {"mimetype": "...", "length": N}
.
I'd be interested in drafting a PR for this, along with some follow up discussions.
It would be great to have a companion for this. @jmaruland are you interested in working on this together?
@padraic-shafer Yes, I would love to. I worked on a very similar issue a while ago when we were trying to move away from JSONSchema models to Pydantic models. I will be fun to revisit this topic.
Fantastic! I'll find a time later this week for us to discuss where to start, and how to proceed.
Follow-up thoughts here:
We already plan to add a route for accessing underlying files, discussed in #473, something like /asset/{id}
. The route we want for this issue has the same meaning: "Get me the underlying file." It should probably be the same route.
Keeping in mind the rule, "Illegal or nonsensical states should be unrepresentable, I think we may not want to put the mimetype in the structure
because it's already in the data_source
:
$ http :8000/api/v1/metadata/example?show_sources=true 'Authorization:Apikey secret' | jq .data.attributes.data_sources
[
{
"id": 2,
"structure": {
"data_type": {
"endianness": "little",
"kind": "i",
"itemsize": 8
},
"chunks": [
[
3
]
],
"shape": [
3
],
"dims": null,
"resizable": false
},
"mimetype": "application/x-zarr",
"parameters": {},
"assets": [
{
"data_uri": "file://localhost/tmp/tmpp0dp686u/data/example",
"is_directory": true,
"id": 2
}
],
"management": "writable"
}
]
And there is space for a size
under "assets"
. It's in the SQL database, just not exposed in the API yet. Maybe better to just refer to those as the truth and let the structure
be null
, same as it is for "container"
structure family.
This is another idea that arose during the NIST visit.
Tiled's data model constrains everything to be one of its recognized structure families (array, dataframe, sparse, node) or JSON-encodable metadata sitting alongside one of those types. There will be cases where there is binary (not JSON-encodable) information that is relevant and that some clients programs will know what to do with.
Our line on this so far has been, "If you have a files, use a static file server or Globus or another file-based solution, and link to that from the metadata in Tiled." And for cases where you have a lot of un-structured data (directory of PowerPoint documents or PDFs) I think think that's the right call. But John Henry at NIST articulated a compelling argument that it is useful to enable Tiled to carry binary data in-line when it's useful.
I think this would take the shape of a new structure family, perhaps
opaque_bytes
. Its structure would simply be a length, and it would be sliceable by byte range. Tiled would not be able to transcode it, only send it in its original representation as bytes. Any context necessary to interpret the bytes would have been either known a priori to a client. The (JSON-encodable)metadata
attached to theopaque_bytes
node may provide helpful information in this regard for a client, but it would be "opaque" to Tiled itself.We are all in agreement that if you have mostly unstructured / opaque data, then Tiled is not adding value and you should just use a static file server. But if you have a little unstructured / opaque data and you want to place it logically alongside structure data, there is an argument that Tiled should enable this.