SciCatProject / scicat-backend-next

SciCat Data Catalogue Backend
https://scicatproject.github.io/documentation/
BSD 3-Clause "New" or "Revised" License
18 stars 21 forks source link

Add ability to group datasets into collections of related datasets #805

Open mkywall opened 8 months ago

mkywall commented 8 months ago

Issue Name

Add ability to group datasets into collections of related datasets

Summary

It could be useful to have a data structure or class that sits between proposal ID and dataset that would allow datasets to be grouped into collections of related datasets. In our use case, we are using the proposal object for unique user proposals that would ultimately translate to a project or publication and the dataset object for sets of data resulting from a single process on a given machine. In many cases we expect to have an experiment in which many related datasets would be generated, for example Optical Electron Spectrometry, XPS, and TEM, where it would be useful to group the dataset objects together to form a distinct collection of data within the proposal / project.

sbliven commented 6 months ago

There was a nice diagram of this in the 2023-12-12 meeting notes: dataset_groups

sbliven commented 6 months ago

From the discussion it seemed that existing work-arounds won't work for you (keywords, derived Datasets, publishedData).

Adding a DatasetCollection seems possible. Some questions that come to mind:

  1. Who owns the collection? Can you make a collection of datasets you don't own? What other permissions are needed?
  2. Do collection have a PID like datasets? Can you retrieve a collection as a unit (presumably yes, this seems like a key feature)? Can you publish a collection as a unit?
  3. Can collection contain other collection? This would enable arbitrary hierarchies.
  4. If groups have metadata, how does this get merged with their children? (I personally think this is going to be too complex, and metadata should just be duplicated between datasets)
nitrosx commented 6 months ago

My personal opinion regarding the questions/opinions posted above by @sbliven. We are discussing/brainstorming with our team, so I might adjust them soon.

  1. Who owns the collection? Can you make a collection of datasets you don't own? What other permissions are needed? The collection can be own by any user. Yes you can create a collection of datasets that you do not own, as long as you have access to. When viewing a collection that lists datasets that the user cannot access, those datasets are excluded by the list and the user should be informed that n datasets are not accessible. I will use the same schema owner/access groups as we have in datasets.

  2. Do collection have a PID like datasets? Can you retrieve a collection as a unit (presumably yes, this seems like a key feature)? Can you publish a collection as a unit? Collections should have a unique id. You should be able to retrieve a collection as unit. I would implement the functionality to create a published data entry based on all the datasets that are referred by a collection. Access to a collection can be made public

  3. Can collection contain other collection? This would enable arbitrary hierarchies. Absolutely yes. This will add the flexibility that might be needed in the future. Close loops will need to be checked and avoided.

  4. If groups have metadata, how does this get merged with their children? (I personally think this is going to be too complex, and metadata should just be duplicated between datasets) Collection's metadata are different from dataset's metadata. It is a user decision to merge them, and it can only be done on the user side based on the purpose of their task.

paulmillar commented 6 months ago

Personally, I like the idea of having a DatasetCollection concept. It is (potentially) a very powerful abstraction, and one that (I think) it would solve a few different challenges that DESY is facing in adopting SciCat.

I'm not sure how much this is obvious, but perhaps Proposal could be considered a specialisation of DatasetCollection. Perhaps, if the DatasetCollection concept is powerful enough then Proposal could be just a DatasetCollection with predefined meaning given to certain metadata fields.

A similar argument could be made for Instrument. I think this is fairly straightforward to see how this could be reformulated as a DatasetCollection.

Sample might be considered a DatasetCollection, if the metadata support is made sufficiently powerful. This would be useful, but I think this would require more effort.

Some thoughts on @nitrosx 's comments:

  1. Do collections have a PID? Potentially yes. I belive we (DESY) may have use-cases for this.
  2. I believe supporting a collection hierarchy would be very useful for DESY use-cases.
  3. Supporting collection-level metadata would also be very helpful. (Or, rather, not supporting metadata would greatly reduce the usefulness of the concept); however, I don't think there's any need to merge collection metadata into dataset metadata. Instead, one could consider a dataset as a member of one or more collections; when viewing that dataset, these memberships would be shown (along with the collection's metadata) but that this information would be distinct from the dataset's metadata.

One possible hierarchy is:

So, viewing dataset X would bring in all this information.

Also, following the child relationships would naturally support queries like what data was taken with this instrument? ...with this sample? ...under this proposal?

nitrosx commented 6 months ago

Thank you all for all your contributions. Here is my proposed solution, which I will be happy to open it up for discussion and feedback:

If we apply the following definition for what we see as a collection:

A collection is a derived dataset where the transformation of the source datasets in the current dataset is nothing more than the "grouping" action.

We could implement the collection concept as a specialized type of dataset, a collection dataset. This approach can be implemented with minimal changes to the current code base and will allow to address all the use cases that have been discussed above and during the meetings.

We would be able to implement the solution by following these steps and it will allow to maintain backward compatibility

Refactoring the hierarchy proposed by @paulmillar, here is his example implemented as follow:

I am not sure about the3 entity that @paulmillar calls BAG allocation 52. We will need to discuss further regarding this entity. Is it a collection of datasets or a collection of proposals? if it is the latter, it is the outside the scope of *dataset collection concept.

nitrosx commented 6 months ago

current_dataset_schema

nitrosx commented 6 months ago

proposed_dataset_schema

nitrosx commented 6 months ago

The previous two images are just a rough draft of the changes that I foresee needs to happen.

paulmillar commented 5 months ago

Thanks @nitrosx, your proposal looks good to me.

I believe it will allow SciCat to support DESY's beamtime allocation concept.

Here are a few comments.

1. Hierarchy

Just to check I've understood correctly :-)

From your proposed definition, a collection is a derived dataset, which itself is a dataset. Therefore, the definition allows a collection to be built from other collections; i.e., this would (in principle) allow the formation of an arbitrary hierarchy of datasets.

Nice.

2. Kinds of collections

I think there may be SciCat instances with different kinds of collections (I'm deliberately avoiding the word "type" :-). In my example hierarchy, there were "Beamtime Allocations" and "Desired Conditions" as two (distinct) kinds of collection. However, the exact dataset hierarchy may end up being facility-specific.

It might be helpful if collections could be identified by their type, I suspect mostly for UI purposes.

The kind of collection would help the researcher understand under which context the dataset(s) were combined, when viewing an individual collection. A researcher might also like to filter their view of multiple collections by showing only a specific kind of collection.

A collection's Scientific Metadata (what data is available and its meaning) might also depend on which kind of collection is it.

Under the proposal, the type field supports only three valid values, so this field would not be able to describe the kind of a collection. I see there's a classification field. Might this be used for store the kind-of-collection information for collection datasets?

Otherwise, the kind-of-collection information could be stored as a field within the scientific metadata, but I think that might be sub-optimal.

3. What is a collection?

This is just a very minor comment.

You wrote:

We could implement the collection concept as a specialized type of dataset, a collection dataset.

This confused me slightly. I read your proposed definition for a collection as indicating it was a specialisation of derived dataset, rather than a specialisation of dataset.

I think either approach would be reasonable, but (to me) there seemed to be an inconsistency in what you wrote -- or did I miss something?

nitrosx commented 5 months ago

@paulmillar thank you for your feedback. Here are my answers to your points:

  1. Hierarchy You are correct!!! The hierarchy can be multilevel, so a collection can group any type of dataset, including collection ones.

  2. Kinds of collections if we implement the collection as a type of dataset, the filed type is used to described the dataset type. At the moment, the collection type could only be stored in the Scientific Metadata. I'm not opposed to add a new field, although I would like to keep the deviations (in term if dedicated fields) between the three type of dataset at a minimum. We should brainstorm about what is the best solution.

  3. What is a collection You are right in your observation: the collection dataset concept is a specialization of the derived dataset one. That said, I am envisioning that they will implemented as three different type of dataset, aka the dataset field type can assume the following three values: raw, derived and collection.

Feedback is encouraged

paulmillar commented 5 months ago

Thanks for your feedback, @nitrosx.

On point 2.

I think simply using Scientific Metadata could work, but I think it might still be worth brainstorming some ideas to see if:

  1. other people see a need to identify different kinds of collections.
  2. how they would like to identify them.

One idea was to add support for tagging datasets; that is, providing the ability for a dataset to have an arbitrary set of keywords associated with it. These would be facility-specific keywords that group together datasets, probably based on some facility-specific procedure; for example, collections that are a "Beamtime Allocation" could be tagged beamtime.

A dataset could have multiple tags.

For example, a dataset may have an associated instrument. I know some people consider (and use) "instrument" and "beamline" as synonyms, but I'm not sure this is universal. Personally, I would see a beamline as part of infrastructure and the detector is the instrument. If so, then tagging would allow us to identify with which beamline a particular dataset was taken (dataset from P08 are tagged P08.

Like this, tagging might also be useful for derived datasets and raw datasets, so your comment about minimising the deviations between the different types of dataset.

Of course, this could all be done within the Scientific Metadata, so it would be interesting to hear from other facilities on whether they see this as useful and something that could be promoted from Scientific Metadata to core metadata.

Cheers, Paul

nitrosx commented 5 months ago

@paulmillar Datasets can already be tagged. Field keywords is a list of arbitrary user defined tags.

paulmillar commented 5 months ago

That's for the info @nitrosx. Sorry, I had forgotten that there was a keywords field.

Out of interest, what is the AuthZ model for keywords? Can a non-facility person (e.g., the PI on the investigation) modify them?

I was thinking of a dataset having a set of tags that the facility can assign, but which the PI (or other non-facility people associate with the dataset) can't modify.

For example, if we were to add P08 as a keyword on a dataset, we wouldn't want someone to accidentally remove the information or change it to P06.

nitrosx commented 5 months ago

@paulmiller At the moment, if a user has editing permissions on the dataset, she can add and remove keywords as she pleases. We do not have a fine authorization control over each keywords.

That said, we the CASL library, the authorization system could be extended to cover your use case. I do not know the effort that it would required. If you feel particularly strong about your use case, please open a new issue to start a discussion. Of course, we will be happy to discuss it, We will be even happier to accept contributed PR implementing such functionality.

dylanmcreynolds commented 5 months ago

Sorry to muddy these waters, I'm been thinking along the lines of triple stores (not my specialty), where facilities could establish the relationships between items separately. I've been thinking recently about how the raw/derived distinction is quite useful but possibly too simplistic. Quite often there are different categories of processed datasets, often (but not always) created serially. Brian Pauw highlighted this at his talk at the SciCat meeting last year.

So, you could say thinks like:

I like the flexibility here, and the ability to keep track of and processing steps.

Of course, mongo is probably a terrible triple store database. Creating one "store" collection with separate documents for each relationship might make mongo unhappy at scale. So, what if this Collections document looked something like this:

{
    "metadata": {
       "name": "an_experiment",
       "creationDateTime": "...",
       "lastUpdateDateTime": "...."
    }
    "relationships": [
          {"s": "[id of RawDataset1]", "p": "NormalizationStep", "o": "[id of DerivedDataset1]"},
          {"s": "[id of DerivedDataset1]", "p": "MajorDataReductionStep", "o": "[id of DerivedDataset2]"},
          {"s": "[id of RawDataset1]", , "p": "Experiment1", "o": "[id of RawDataset2]"},
    ]

}

EDITED: previous version of example swappted p and o

FWIW, for the example, I adopted this which I actually know nothing about. :)

Going even further these links could describe things like proposals, instruments, URLs to objects outside of SciCat like DOIs or, whatever.

paulmillar commented 5 months ago

Hi @nitrosx, Thanks for the clarification. I guess that the keywords field is intended to hold ancillary information, and not information that drives core business. For example, if a facility would like to provide an overview of all datasets produced by a specific beamline then this shouldn't be done by tagging datasets with a beamline-specific value in the keywords field.

If so, then this is (of course) perfectly fine.

As you say, it might be possible to tweak the AuthZ model using CASL, but (to me) that feels like the wrong approach: if keywords is intended to capture user-oriented (perhaps user supplied) information then it shouldn't be repurposed for a facility-internal task.

Stepping back slightly, the goal here would be to allow a facility to identify different kinds of collections. Tagging (allowing a set of facility-specific keywords) is just one way this could be supported. I'm sure there are other ways.

paulmillar commented 5 months ago

Hi @dylanmcreynolds,

I have rather mixed feeling with triple stores. On the one hand, I am quite fascinated by the shear expressive power of them, particularly when adopting the well established knowledge-capturing languages (RDF / RDFS / OWL/ SKOS, etc). On the other hand, I see that (except in specific niche domains, such as JSON-LD in webpages) almost nobody uses them when storing metadata. I don't think I've found a definitive reason for this, but I've read (what feels like rumours) that triple stores just don't scale --- worse than RDBMS (supporting ACID). I don't know if that is really true though: there seems some effort into exploring or dispelling this (e.g., see this page).

In terms of using Mongo as a triple store: I also have no experience. It might do better than a dedicated triple-store (for particular use-cases), but maybe not. I suspect it makes most sense when somebody is already providing a document store service and you one to take advantage of that service to store your graphs.

Separate from the implementation details, what you're describing (to me) sounds like building an ontology for holding the SciCat metadata ... or, at least, starting to go in that direction.

As you might guess, personally, I like the idea of using a triple-store for storing metadata (as I said, it's really powerful); however, I've noticed that all the catalogues I've seen don't do this. I don't really understand why this is, but adopting an RDF-like structure would risk going "off the well trodden path".

This is what I like about @nitrosx 's proposal. It is a relatively minor, incremental approach: a (relatively) low-risk strategy, which is important for software that is already in production.

Your last comment Going even further these links could describe things like proposals, instruments [...] is somewhat similar to my observation that these concepts are really nothing more than a collection (all datasets of some proposal, all datasets collected with some instrument, ...) with some specific metadata.

This might have merit (in terms of simplifying the code, with corresponding benefits), but I still appreciate @nitrosx's approach where these are left as-is. They can always be updated in the future, if it makes sense.

dylanmcreynolds commented 5 months ago

Hi @paulmillar,

I understand your hesitancy about triple stores. It's an area that I have no hands-on experience with. I suspect we don't see examples triple stores in Mongo because it's a bad idea, at least for the intended purpose of "establish relationships across the data in the world". I proposed this for the more limited case of a DatasetCollection collection in Mongo that maintains flexible links between datasets. I then went towards triple stores and the json representation I linked to mostly because I'd rather not invent a new protocol if one already exists, and it actually seems to describe links pretty well. Maybe it's not the right fit here, maybe it is.

Perhaps my idea of other datatypes here was too far, but honestly, I can see experiments that span multiple instruments, or were conducted under multiple proposals that we want to capture the relationships for.

So to step back, this original issue talked about link-store that sits between proposals and datasets, and I think there are use cases that cross proposals, I'd like to nudge the thinking toward a higher level collection.