dev - Githubissues

kleinhenz commented 2 years ago

This pull request represents the current state of the staging server. Opening this up now to get some feedback. In particular see aimmdb/aimmdb/models.py for the pydantic models used for validating documents in the database and aimmdb/aimmdb/tree.py for the implementation of the tiled tree.

Below description copied from documentation in ingest/ingest.ipynb:

The data is stored in three collections: measurements, samples, and tree.

The measurements collection stores the individual XAS measurements. Documents in this collection follow the model aimmdb.models.XASMeasurement. Each measurement contains a sample_id field which indexes into the samples collection.

The samples collection stores data about the samples that the measurements are performed on (e.g. the composition). Documents in this collection follow the model aimmdb.models.Sample. Note, one sample can have many measurements associated with it. As mentioned above, this one to many association is modeled via the sample_id field of the documents in the measurements collection.

The tree collection stores a hierarchical layout of the data suitable for browsing via tiled. Documents in this collection follow the model aimmdb.models.Node. The tree structure is modeled using the materialized paths mongo pattern. For example to query all nodes below the /core path one can use db.tree.find({"path": {"$regex": "^/core/[^/]*$"}}) The leaf nodes in the tree carry a data_id key which indexes into the measurements collection where the actual data is stored.

kleinhenz commented 2 years ago

To write down my thoughts on this in a little more detail, I think:

1) tiled is built around the idea of serving a tree. 2) For a database with any complexity there will not be a canonical tree layout for that data. You can imagine wanting a completely flat structure with interaction built around search or a deeply nested structure which corresponds to one particular choice of groupings. Both of these extremes are perfectly defensible and have different associate tradeoffs. 3) Because of 1. and 2. we should think about the tiled tree as a view of the data rather than its canonical representation. 4) This separation should exist at the database level in the form of a tree collection storing a particular tree view of the data with references to a data collection which holds the data itself. This allows us to easily create and destroy multiple different tree views without touching the underlying data which has no inherent tree structure.

This is basically what is implemented in this pull request. I feel happier with this than I did with my previous efforts which mixed tiled's representation of the data with the underlying data in a way that felt very messy.

danielballan commented 2 years ago

I like this line of thinking. Some thoughts....

Why is Tiled built around trees? Is that necessary? Is it good?

The tiled serve directory ... use case requires a tree structure to make the tree structure of file systems.
Many of the structures we support, including xarray.Datasets, BlueskyRuns, and h5py Groups, have a tree structure.
URLs in an API typically have a structure that I would describe as tree-like, as in /a/b/x, a/b/y, a/b/z.

So I think we are committed to trees in the design.

Use cases

Let's talk about the use cases we know about.

Bluesky

A typical path looks like /csx/raw/{uid}/primary/data/data_vars/I0/variable. The node /csx/raw is special: that's the one that can do MongoDB-backed queries. The levels of nesting above that are static (they come from the config file) and the levels below that represent the internal structure of a Bluesky Run.

In https://github.com/bluesky/tiled/issues/158 I proposed moving from

/node/metadata/csx/raw/{uid}

to

/node/metadata/csx/raw/_/{uid}/...

to make room for "virtual" trees that resolve as searches on the backend, like

/node/search/csx/raw/proposal/{proposal_id}/

which would be equivalent to

/node/search/csx/raw?filter[proposal][condition][id]={proposal_id}

The search results returned by that end point would have self-links into /node/metadata/csx/raw/_/{uid} such that the _ tree, keyed on uid, is the canonical tree—the one corresponding to the flat database collection—and any other trees like /csx/raw/{query_type}/... return results that link into it. That is, the items returned by /node/search/csx/raw/{query_type} would have links like {"self": "/node/metadata/_/{uid}"}. You would never see /node/metadata pointing at /csx/raw/{query_type}/... because that tree is only a virtual tree of search results, not the home of any node.

Files

If we ever want to support fast search for the tiled serve directory .... use case, we'll need to crawl the files and cache their metadata in a proper database. In this case I think the natural unique ID is the full path to a given file. One could imagine a URL structure where the canonical tree is addressed like

/node/metadata/_/literal/path/to/file

and virtualized trees (search results) look like

/node/search/element/Ni

"Ingested" data

If at ingestion time we give everything a unique ID (UUID4?) then we can give it a canonical home like

/node/metadata/_/{uuid}

and virtualized trees like

/node/search/element/Ni

that return items like {..., {"links": {"self": "/node/metadata/_/{uuid}"}}.

Off the cuff thoughts

We could do this with a single MongoDB collection. I like that that is (A) simple and (B) fast because it involves only a single database looks up. ("Joins" are not a thing in MongoDB.) Bluesky smears its data across multiple collections, and you really see that drag on performance. Keeping to one collection, at least in simple cases, should be preferred unless we have a strong reason to need more.
Producing virtualized tree like /element/X involves a query, of course. If that query becomes slow, we have options. We could add an index to the single MongoDB collection on the relevant metadata keys. (Mongo lets you index on sub-keys, like metadata.common.Element. It also lets you do "sparse indexes", so it's OK if the key in question isn't defined on every document in the collection.). If that's not enough, we could add other collections, effectively acting as materialized views, playing the role of the "tree" collection proposed above.

kleinhenz commented 2 years ago

So one distinction that has come to my mind thinking about this recently is that trees are more like nested groupby operations than straightforward queries because at each level it tells you the possible sub keys. For example consider two paths in the current deployment c["core"]["Al2O3-1"]["O-K"], c["NCM"]["BM_NCM622"]["Ni-L3-pristine-1"]. These paths corresponding to a grouping by (category, sample, measurement). And at each level you can see all possible options.

It's not clear to me how this kind of nesting and discovery would work in https://github.com/bluesky/tiled/issues/158. If I go to /node/search/element do I get a list of unique elements? I guess this is probably workable. But then what if I want to groupby a second field? I guess you could have a virtual tree that works like /node/groupby/(category, sample, measurement)/x/y/z but this is starting to look like a mess and I'm not really sure if you can do it performantly in a nice stateless way since the groupbys are not necessarily cheap.

So to me it seems valuable to be able to serve static non-trivial trees which provide a view corresponding to a particular set of groupings which is what is implemented in this proposal. If you can think of a clean way of doing this all dynamically then that is probably nicer and more flexible but I think it is pushing quite a bit into the tiled layer.

danielballan commented 2 years ago

Are these GROUPBY or are they a combination of SELECT … WHERE … and SELECT DISTINCT … WHERE …? With indexes on the relevant keys, I think those can be fast.

danielballan commented 2 years ago

I think the dict lookup like

c["NCM"]["BM_NCM622"]["Ni-L3-pristine-1"]

implies that there is a single privileged tree, same as a file-based tiled server would have, which could be encoded using ancestors as you proposed a couple weeks ago. Other orders of filtering would be available from Python using the search method.

kleinhenz commented 2 years ago

The user sees one privileged tree but that tree is a separate construct from the underlying data which makes it easier to experiment with different tree layouts compared to the situation where the tree structure is placed together with the data itself. I don't think having a separate tree collection should make this too slow. I'm not performing any aggregation, just doing single lookups into the data collection which happen when the data is read, but I don't know mongo that well so maybe it will be problematic.

You're right I think it is a combination of SELECT ... WHERE ... and SELECT DISTINCT ... WHERE ... since you don't need access to the groups you don't drill down into so maybe this can be made dynamic and fast. Still not clear to me what the api for something like this should look like particularly in the nested case.

danielballan commented 2 years ago

That's pretty convincing. I think our ideas are at least roughly compatible, and I agree that putting the tree structure off to the side (in a separate collection) feels tidy.

I think there should be more discussion and consideration of implications before we really commit to a structure, but I think what you have here is self-consistent and at least in the proximity of where we'll ultimately land. For the sake of getting production aligned with the latest thinking, I move to merge this early next week.

AI-multimodal / aimmdb

dev #9

Why is Tiled built around trees? Is that necessary? Is it good?

Use cases

Bluesky

Files

"Ingested" data

Off the cuff thoughts