bluesky / event-model

data model for event-based data collection and analysis
https://blueskyproject.io/event-model
BSD 3-Clause "New" or "Revised" License
13 stars 30 forks source link

Reviewing StreamResource and StreamDatum from the storage, access perspective #296

Open danielballan opened 7 months ago

danielballan commented 7 months ago

During the discussions that developed StreamResource and StreamDatum documents, the "producer" side was being built, but much of the "consumer" side remained to be developed. I think we agreed to stay open to adjustments when the time came to build out the consumer side. I hope that window is still open.

Here's a StreamResource from DLS I22, courtesy of @DiamondJoseph:

  {
    "name": "stream_resource",
    "doc": {
      "uid": "158167d9-796b-4e12-afff-5b3dd903815c",
      "data_key": "waxs-sum",
      "spec": "AD_HDF5_SWMR_SLICE",
      "root": "/dls/i22/data/2023/cm33873-5",
      "resource_path": "i22-723667-waxs-hdf.h5",
      "resource_kwargs": {
        "path": "/entry/instrument/NDAttributes/StatsTotal",
        "multiplier": 1,
        "timestamps": "/entry/instrument/NDAttributes/NDArrayTimeStamp"
      },
      "path_semantics": "posix",
      "run_start": "617184e5-5e24-40b1-80f7-0056d51e44f8"
    }
  }

I can see nothing to change about these:

I might suggest re-evaluating that names of:

In Tiled, we use MIME types, including standard ones such as image/tiff and custom ones named like application/x-hdf5-smwr-slice, in the role of spec. Maybe it's worth considering whether we want to use mimtype here, as it is a recognized standard way to spell, "This is a format, which tells you which I/O code to read it."

Taking a broader-than Python look at the world, resource_kwargs (emphasis on "kwargs") is a bit of a Python-ism. These are JSON-serializable parameters that are needed by the code that reads the data. This code does not necessary have to be in Python. In Tiled, the analogous entity is simply called parameters.

The existing names aren't doing any harm, so we have to weigh the pain of change against the marginal benefit of maybe-better names. If we leave them as is, I think that would be fine. Just worth considering.

The way we spell the location of the data seems more problematic:

I think we should consider replacing all of this with a URI, like file://localhost/dls/i22/data/2023/cm33873-5/i22-723667-waxs-hdf.h5".

  1. We foresee data being available in protocols other than file:, such as s3:.
  2. The partitioning of the path between root and resource_path is a guess about what parts of the path will be stable over time and what parts may change as the data moves. Different people make different guesses, and at NSLS-II this has been a mess.
  3. IIRC, path_semantics is completely vestigial, based on a misunderstanding that Windows required backslashes in paths. In fact, anything newer than Windows 95 is fine with forward slashes.

In Tiled, the same logical resource may be available in multiple places. This is expressed as a one-to-many relation between logical datasets and URIs (effectively). If the StreamResource just had a URI, this would be simpler and more future-proof.

Here is a StreamDatum:

  {
    "name": "stream_datum",
    "doc": {
      "stream_resource": "158167d9-796b-4e12-afff-5b3dd903815c",
      "uid": "158167d9-796b-4e12-afff-5b3dd903815c/2",
      "seq_nums": {
        "start": 6,
        "stop": 7
      },
      "indices": {
        "start": 5,
        "stop": 6
      },
      "descriptor": "8ff32942-9b21-4542-aee1-16c04291950d"
    }
  }

The descriptor is on StreamDatum instead of on StreamResource with data_key. I think it's worth revisiting whether we really really need that, because life would be simplify if the descriptor and data_key were together on the same document.

tacaswell commented 7 months ago

I'm agnostic to the key name changes.

Adding the file: vs s3: vs ... should be safe to do in a back-compatible way if we assume that anything with out a protocol is file:. I think this can be done independent of any other changes.

I'm hesitant to give up all information about which parts of the uri are mounting / hosting details and which are "real". Although this came from a time when we had the same filesystems mounted under very different paths on different machines with much higher regularity, having the ability to "make the path relative" (for lack of a better term) seems generically useful (maybe we should have gone with a single integer rather than two parts of the path). However, given the growing usage of containers and the trivial ability to be re-mount any level of path on the host to any level of path in the container it may make sense to lift all of this up to be a client configuration problem (which opens the door to much more complex re-writing rules).

"it has not been well used" is not a compelling argument to remove it (it is a compelling argument for better docs!), but "it does not actually solve the problem" is.

callumforrester commented 7 months ago

I'm also agnostic to the key name changes, maybe leaning slightly in the direction of pro, especially for mimetype and general reduction of the specialness of the documents.

I'll hold my hands up and say I hadn't noticed that stream_datum referenced descriptor. I agree that's not necessarily needed. If we move it to stream_resource do we need to keep the reference to run_start as well?

I agree with @tacaswell that the advent of containers is likely to make different path mounting more well-used. In fact I strongly encourage it because baked in assumptions of multiple machines using the same mount point has bitten me many times!

Also happy with the file:// and s3:// idea. If I were designing from scratch I might make it more structured

{
    "protocol": "file",
    "location": "/data/foo/bar"
}

But if :// is better for backwards compatibility then so be it :)

danielballan commented 7 months ago

Exactly. The existing names aren't bad, but if we can use widely-known names and standards, developers don't even have to notice them.

Let's talk about the location of descriptor on a future call. I think we should keep run_start either way, to effectively save on a JOIN (a literal JOIN in SQL, a metaphorical JOIN in MongoDB...).

I like structure things, but if we care about covering s3: and others, we'd need to cover the full URL spec, as in:

{'scheme': 'file',
 'netloc': 'localhost',
 'path': '/a/b/c',
 'params': '',
 'query': '',
 'fragment': ''}

and I think the string form is probably the easiest thing to pass around in documents. Not a strongly-held opinion, though.

callumforrester commented 7 months ago

I'm not too attached to structure either but I don't think the full URL spec is a terrible thing to pass around. I think it might be best to see if anyone does hold strong opinions either way.

danielballan commented 7 months ago

That is fair…and making me wonder whether Tiled should store the URI in a structured form at rest. That could enable indexing and more efficient filtering by scheme or netloc, which could come up…

danielballan commented 7 months ago

From discussion:

tacaswell commented 7 months ago

If we move the descriptor to StreamResource that implies that we can have multiple StreamResource with overlapping keys for the same stream to support changing the configuration mid-stream. I suspect that is going to bring a bunch of complications down the line a well.

danielballan commented 7 months ago

Even so, think it is the lesser evil. We can discuss. There is a tension here between what is easy for the producer and what is simpler (and performant) for the consumer.

callumforrester commented 7 months ago

I must confess to getting a bit lost among text descriptions of relational hierarchies, so I drew it out as a sanity check, does this look right?

Top is current state, bottom is proposed change: event-model-changes

Also tagging @coretl

coretl commented 7 months ago

If we move the descriptor to StreamResource that implies that we can have multiple StreamResource with overlapping keys for the same stream to support changing the configuration mid-stream. I suspect that is going to bring a bunch of complications down the line a well.

@tacaswell we already reduced the scope of StreamResource so that it only points to a single data_key, so I don't think keys could overlap? We also already need multiple StreamResource documents per Descriptor because we emit a new one per HDF file, and pausing/resuming is planned to start writing into a new file.

I don't think I have many opinions either way here, both options are the same complexity to produce in ophyd-async, and not much different in the run bundler, so happy with either.

coretl commented 7 months ago

Also tagging @DiamondJoseph for comment

danielballan commented 7 months ago

Filling in the motivation for my terse notes above:

The Tiled adapter that loads Bluesky documents has to chase a lot of references, illustrated in Callum’s diagram, to answer the simple question, “What data do you have?” The overhead is especially unfortunate when it is asked to return pages of search results describing ~300 datasets at a time.

Placing descriptor and data_key in separate documents sets us up for more of this. It is maximally-normalized, but adds complexity and a JOIN. I don’t feel the trade-offs have been fully explored, but I think we should examine whether a little de-normalization (reissuing StreamResource for each Descriptor) could be worthwhile here.

callumforrester commented 7 months ago

Also also tagging @garryod

DiamondJoseph commented 7 months ago

Discussed with TC just before lunch, we have assumptions about Descriptors which motivate keeping Descriptor reference on StreamDatum.

If a new Descriptor being emitted also emitted a new Resource: requires that the Device know when the Descriptor changes (for the DatumFactory), or that the Datums are modified to point to the correct Resource (although I suppose currently they have their Descriptor injected?)

danielballan commented 7 months ago

keeping Descriptor reference on StreamResource

Do you mean keeping the Descriptor reference on the StreamDatum, or moving it to the StreamResource? I think you mean the former, just checking.

tacaswell commented 7 months ago

We also already need multiple StreamResource documents per Descriptor

Fair, so in a given run we are going to have multiple StreamResource per key anyway. Once we have that, I don't the the additional complication of "and they may point to different descriptors" adds much extra pain.

To be clear:

I think the following set of documents is valid (assuming only Stream* documents:

- Run (start document):
  - primary stream (synthetic concept)
    - descritptor 0 (document)
      - keyA (synthetic)
        - StreamResource0 keyA (document)
          - StreamDatum0 (document))
          - StreamDatum1 (document)
          - StreamDatum2 (document)
          - StreamDatum3 (document)
        - StreamResource2 keyA (document)
          - StreamDatum7 (document))
          - StreamDatum8 (document)
      - KeyB (synthetic)
        - StreamResource1 KeyB (document)
          - StreamDatum4 (document)
          - StreamDatum5 (document)
          - StreamDatum6 (document)
    - descritptor 1 (document)
      - keyA (synthetic)
        - StreamResource3 KeyA (document
          - StreamDatum9 (document))
          - StreamDatum10 (document)
      - KeyB (synthetic)            
        - StreamResource4 KeyB (document
          - StreamDatum11 (document))
          - StreamDatum12 (document)          
  - second stream (synthetic concept)
    - descritptor 2 (document)
      - KeyC (synthetic)            
        - StreamResource5 keyC (document
          - StreamDatum13 (document))
          - StreamDatum14 (document)

That is we have two streams, one with A and B, one with C. In the primary stream we have two descriptors (because we changed config or something) and with in the first descriptor we had to generate a second StreamResource.

I added levels of nesting for "stream" and "key" which are emergent / synthetic concepts....


Discussed with TC

TC is not unambiguous 😜

danielballan commented 7 months ago

Maybe each TC should pick a UUID4 we can use.

DiamondJoseph commented 7 months ago

keeping Descriptor reference on StreamResource

Do you mean keeping the Descriptor reference on the StreamDatum, or moving it to the StreamResource? I think you mean the former, just checking.

Yep, just completely said the opposite of what I meant. Edited the comment for posterity.

Discussed with TC TC is not unambiguous 😜

Would you believe we had 3 Joes in the group before I joined Diamond? I've been explicitly Joseph for years, but it's become almost a necessity. Too used to people signing comments with just initials -JW.

padraic-shafer commented 6 months ago

[After reading through the schemas and the above discussion, most of my confusion has lifted. Here is what remains...]

@tacaswell The document snippet in your example is really helpful for visualizing a possible outcome.

In descriptor0 the keys have an unequal number of StreamDatums. I gather this was intentional to reflect a realistic scenario. Is there a constraint that the total number of events for KeyA must equal the total number of events for KeyB?...and that they have matching sequence numbers? So in this example, StreamDatum{4,5,6} would presumably have more events than most StreamDatums under KeyA.

coretl commented 6 months ago

In descriptor0 the keys have an unequal number of StreamDatums. I gather this was intentional to reflect a realistic scenario. Is there a constraint that the total number of events for KeyA must equal the total number of events for KeyB?...and that they have matching sequence numbers? So in this example, StreamDatum{4,5,6} would presumably have more events than most StreamDatums under KeyA.

It is enforced by the run bundler that each key must have the same number of events and they have matching sequence numbers. It is also currently enforced by the run bundler that each StreamDatum for each key must be the same length as those for the other keys in the same descriptor. This is an implementation detail, but could be made a guarantee if helpful...

padraic-shafer commented 6 months ago

Thanks, @coretl ! I think having it as an expectation (rather than a guarantee) is sufficient, and handled by the RunBundler makes sense.

So, I would then say that the details under descriptor 0 are technically valid, but should never be present in practice. There should be the same number of StreamDatums under Key A and Key B.

danielballan commented 6 months ago

A proposal developed with @callumforrester today:

Instead of specifying a file x/y/z mounted at /a/b/ as:

{
  "root": "/a/b/"
  "resource_path": "x/y/z"
}

how about

{
  "uri": {
    "scheme": "file",
    "path": ["a", "b", "x", "y", "z"],
    "netloc": "localhost"
    "query_params": [],
  }
  "facets": [
    {"segments": {"start": 0, "end": 2}, "feature": "root"}
  ]
}

That is, the URI is given as an object, with a side band ("facets") explaining the semantics of certain segments of the path. This enables:

I learned about a "facets" comment from a UI dev working for the other Bluesky, explaining why Bluesky posts don't just use Markdown. The requirements resonate with ours. A developer can freely ignore the facets if they don't (and don't want to) understand them, but they are can be used to give extra meaning to the path.

danielballan commented 6 months ago

For completeness I should mention that @tacaswell originally considered something like:

{
  "resource_path": "/a/b/x/y/z",
  "separator": 2
}

which, similarly uses a side-band. The "facets" proposal above is more flexible in that it can describe any segment and more than one.

garryod commented 6 months ago

Coming late to the party here, but I've got a few thoughts from a consumer's perspective...

Using URIs to specify resource locations is ideal but they should be encoded as described in IETF RFC 3986 (i.e. a string) - encoding them as any kind of struct will require users to write custom deserialization code when Url::from_string exists in all popular languages.

Absolute file paths are impossible where shared file systems are concerned as the file system can - and will be - be mounted at an arbitrary path. As such it only makes sense to use relative paths when describing the location of files and have the base directory agreed out of band - or perhaps with an additional field dedicated to this purpose though this has us defining a higher level protocol for resolving file systems.

danielballan commented 6 months ago

I like that argument for URI as string.

danielballan commented 6 months ago

I'm really glad you chimed in @garryod. Musing on:

Absolute file paths are impossible where shared file systems are concerned as the file system can - and will be - be mounted at an arbitrary path.

...that is true at NSLS-II as well (and has been a source of many headaches).

From RFC 8089, the "file" URI scheme:

The "host" is the fully qualified domain name of the system on which the file is accessible. This allows a client on another system to know that it cannot access the file system, or perhaps that it needs to use some other local mechanism to access the file.

with examples including file://host.example.com/path/to/file

I wonder if we can go with URIs like:

file://data1.nsls2.bnl.gov/chx/proposals/2024-1/pass-333333/assets/image00001.tiff
file://nfs1.diamond.ac.uk/i22/data/2023/cm33873-5/i22-723667-waxs-hdf.h5
file://localhost/a/b/c.h5  # equivalent to the shorthand, file:///a/b/c.h5

For testing, dev, and simple deployments with local storage, we can just use localhost with the local absolute path. For facility-scale deployments, we can, as @garryod said,

use relative paths when describing the location of files and have the base directory agreed out of band

The out of band mechanism could, for example, be local configuration mapping the domain in the URI to the local mount point, e.g.

mounts:
- domain: data1.nsls2.bnl.gov
  root: /nsls2/data1

I like that this would use existing standards, a single string URI, and no additional fields in the document that would require explanation/documentation.

garryod commented 6 months ago

From RFC 8089, the "file" URI scheme:

The "host" is the fully qualified domain name of the system on which the file is accessible. This allows a client on another system to know that it cannot access the file system, or perhaps that it needs to use some other local mechanism to access the file.

Hadn't come across this before, but in the absence of any better suggestion it seems like the way to go even if it's only currently a Proposed Standard.

The out of band mechanism could, for example, be local configuration mapping the domain in the URI to the local mount point, e.g.

mounts:
- domain: data1.nsls2.bnl.gov
  root: /nsls2/data1

Feels like there should be an of the shelve way of doing this - like the /etc/hosts file for IPs and domain names - but a quick google turned up nothing. Has anyone come across something that would fulfil this role?

danielballan commented 6 months ago

I can't find anything either. Since this aspect is read-time configuration, not part of the document stream or storage at rest, it would be comparatively easy to change it later. The file URI aspect is the part we really need to get right from the start.

tacaswell commented 6 months ago

...that is true at NSLS-II as well (and has been a source of many headaches).

This is exactly the reason why we have root in the resource/datum and root_map on databroker to begin with.

I wonder if we can go with URIs like:

I hope we can do that re-writing at the "I'm consuming documents and pushing them into tiled" stage and not at the "I generated these in ophyd" stage as I do not think we want to inject the details of mounting into the ophyd devices (as reflecting on how root/root_map has worked at NSLS-II the main problem is we lost control of the root being set correctly (because it was done at the ophyd layer on each device in each profile)).

danielballan commented 6 months ago

The root + resource_path formulation effectively uses the local mount point (e.g. /GPFS) to identify a given networked storage volume. This is potentially ambiguous. In the bad old days, there were examples on the floor of the same mount point on two different hosts pointing to two different storage clusters. There was no valid root_map solution to that, and we had to write some custom spaghetti to deal with certain situation.

Using the hostname of the storage array is unambiguous.

When possible, I think it would be best to set the URI from the start in ophyd, so that streaming consumers that happen to be on different hosts (and may have a different view of the storage) can work with it. If the engineer configuring ophyd just hands us an absolute path, we can emit file://localhost/{path}. A consumer could potentially flag that, or even rewrite based on examining the {path].

danielballan commented 6 months ago

Offline, @tacaswell argue that if ophyd puts the storage host in the URI, and then storage changes, we have to update all the "clients" (ophyd config). I think he's convinced me that ophyd should just emit localhost URIs and that some downstream consumer, centrally managed, should rewrite them.

But there is room for variability here: this is just an argument at the level of what to recommend.

garryod commented 6 months ago

I think he's convinced me that ophyd should just emit localhost URIs and that some downstream consumer, centrally managed, should rewrite them.

I may be missing something here, but how will the downstream consumer (tiled?) know what to rewrite them to without knowledge of how filesystems are mounted on each and every machine a ophyd device runs on? Or is the thought that this is less ownerous than having each ophyd device configured with this?

danielballan commented 6 months ago

Or is the thought that this is less ownerous than having each ophyd device configured with this?

I think that's the idea, yeah. I'm not 100% sold either way on this.

danielballan commented 6 months ago

I'd like to try to converge on a proposal:

Example:

  {
    "name": "stream_resource",
    "doc": {
      "uid": "158167d9-796b-4e12-afff-5b3dd903815c",
      "data_key": "waxs-sum",
      "mimetype": "application/x-hdf5",
      "uri": "file://gpfs.diamond.ac.uk/dls/i22/data/2023/cm33873-5/i22-723667-waxs-hdf.h5",
      "parameters": {
        "path": "/entry/instrument/NDAttributes/StatsTotal",
        "swmr": true,
      },
      "run_start": "617184e5-5e24-40b1-80f7-0056d51e44f8"
    }
  }

In the end, I believe no one is advocating for changed to stream datum. That would remain as:

  {
    "name": "stream_datum",
    "doc": {
      "stream_resource": "158167d9-796b-4e12-afff-5b3dd903815c",
      "uid": "158167d9-796b-4e12-afff-5b3dd903815c/2",
      "seq_nums": {
        "start": 6,
        "stop": 7
      },
      "indices": {
        "start": 5,
        "stop": 6
      },
      "descriptor": "8ff32942-9b21-4542-aee1-16c04291950d"
    }
  }

Out of scope for this stream is whether to rename this document (as @coretl proposed). That would affect the stream_resource key. Let's leave that for a separate GH issue, if we take it up.

DiamondJoseph commented 6 months ago

I'm enthusiastic about removing path_semantics, renaming spec and resource_kwargs. I'm on board with renaming root (our devices should always be mounting either /device/data = /localhost/dls/ixx/data or /device/dls/ixx/data = /localhost/dls/ixx/data, so we avoid your bad old days by having our own different bad old days, which we're still in).

I'm opposed to renaming stream_resource and stream_datum: I think it's clear (to me) what they mean and how they relate to existing documents, without being easily confused with them.

prjemian commented 6 months ago

Help me to avoid losing focus here. localhost refers to the workstation running the event-model process? Not the IOC that writes the image?

I'm thinking of area detectors with their own filesystem that are not mounted on the host where the event-model process is running.

tacaswell commented 6 months ago

Or is the thought that this is less ownerous than having each ophyd device configured with this?

I think the actual problem with the current scheme is that we left it up to the ohpyd classes where to do the root / resource path split (and to do the windows <-> posix remapping) and in general it was in the wrong place so if we move to using a uri but don't change where it is configured I do not think we will do anything other than change the spelling (if you use root_map to convert root -> hostname and tack on file:// you can get a uri).

If we continue let ophyd/RE hold the mapping (but spell it with in-band encoding) then the following story will happen:

Not exactly the same problems we have now, but I think they will be terrible in about the same ways.

Hence, I think the mapping from file://localhost/absolute/path to file://data.host.dns.entry/relative/path either needs to be done "at the edge" by the code that writes these documents down to persistent storage (as if everything is streaming I think it is reasonable to expect that either you have consistently mounted your network storage or the client who has storage mounted differently know how to do the transformation) so that we only have to maintain a shadow copy of the IOC fstab(s) in one place or it the AD IOCs need to provide a PV that provides the root -> host mapping generated by reading fstab (or the file shuffling service or what ever). If we are being really ambitious a PV that gives the uri directly (rather than the absolute filename from the point of view of the IOC) that would be even better.

We need to make sure that where ever we land on this we don't make the low-infrastructure use case (dev, testing, small deployments) does not get needlessly complicated.

danielballan commented 6 months ago

@prjemian Good point of clarification! I think we can stipulate that if ophyd hands us a localhost URI, it means localhost from the point of view of the process that is writing the file (i.e. the IOC).

It's yet to be settled in the discussion where/how we should recommend that this gets mapped into a portable representation (file://data.host.dns.entry/relative/path).

callumforrester commented 6 months ago

@tacaswell This is exactly why we can never buy data2 in our current scheme (GDA et al.)

coretl commented 6 months ago

How about ophyd produces URIs like file://ioc-hostname/absolute/path? That is knowable from CA/PVA, and would at least tell you who wrote the file at that location...

tacaswell commented 6 months ago

file://ioc-hostname/absolute/path seems like a good middle ground to me. No extra PVs and a service that persists the documents can own the IOC -> mount path (so we only have 1 shadow copy of fstab) can reliably fix it with no heuristics / looking at values in other documents. Down side is that it assumes your IOCs have dns names... are uris like file://ip.ad.res.ss/absolute/path allowed? Or should we just not take the case of "I am running a big facility but don't use DNS" seriously?

My understanding of localhost in this context is "YOLO, hope you have consistent mounts 🙃 " like passing around a "bare" /absolute/path now.

coretl commented 6 months ago

All our IOCs are in DNS, but the vxWorks ones don't know anything about it... So we can get IP address out of cainfo, reverse lookup the name in DNS, then put that in the URI. Probably not the end of the world if the entry is not in DNS and IP addresses get into the URI though?

danielballan commented 6 months ago

Yes, an IP address in the URI is fine.