SciCatProject / scicat-backend-next

SciCat Data Catalogue Backend
https://scicatproject.github.io/documentation/
BSD 3-Clause "New" or "Revised" License
19 stars 21 forks source link

Proposed types for Scientific Metadata #984

Open nitrosx opened 7 months ago

nitrosx commented 7 months ago

This issue is an attempt to summarize the issues and discussions following PR #925 and issues #924 and #940.

In the biweekly meeting, the community expressed support in having multi-dimensional quantities in the metadata, although it is concerned by the fact that there is no limit on the dimensionality of the array, opening up to the possibility to allow a user to save a time series as metadata value.

To address the use case presented in #924 but avoid lengthy time series in metadata, I proposed to allow the following metadata types (italics indicates new types):

Search on types number, string and quantity will continue to behave as they currently behave. I proposed the following search syntax and behavior for the new types:

All the relevant tests must be added in to the BE and FE.

This is just a proposal. Please comment below and elaborate further the ideas.

bpedersen2 commented 7 months ago

Sounds good to me and should cover a good portion of our use cases.

jkotan commented 7 months ago

Hi @nitrosx, not sure I understand idea. Would you like to add unit support for only the above mentioned types or forbid all other types in scientific metadata (hope not). At DESY we have to types of Datasets: scan datasets and measurement datasets. The latter groups metadata from the scans. In the measurement metadata we often use lists to aggregate metadata from different scans e.g. a list of ScanCommand or inputDatasets (in the case measurement datasets are raw-like). Also we had discussion with our beamline scientists how to store 3x3 hkl matrix. Currently we store it encoded in a string but often it is stored as a list of 9 numbers. I could also imagine that some one would like to store 4-vectors which are much more natural in theory or particle physics.

nitrosx commented 7 months ago

@jkotan unit support will be available for the following types:

Regarding your examples of scan and measurements dataset, could you please provide an example of them. I'm not sure I do understand what are their differences and how relevant they are regarding allowing multi-dimensional quantities in the metadata.

Regarding your last two points, my main concern (and the concern of most of the collaborators) is that allowing n-dimensional quantities in the metadata, we will exhaust the available space in the document and there will be a sprawling of time series in the metadata or data slowly sipping in the metadata. I will be happy to discuss!!!

jkotan commented 7 months ago

Hello @nitrosx, At DESY we've started with creating dataset for each scan (scan dataset). However, we perform a lot of scans, so our IT has reported that our scan datasets uses a lot of DB resources. Therefore we start to think to create a dataset not for a single scan but for a group of scans related to each other i.e. some kind of measurement e.g. 'calibration'. For such measurement we create a measurement dataset where we group most important scientificMetadata from our scans. For the string quantities which are constant it is easy,

scientificMetadata:
   DOOR_proposalId: "99991173"

For number quantities which are almost constant we store an average, min, max and std.

scientificMetadata:
  source_current: 
    counts: 3
    max: 0.02578197419643402
    min: 0.025225341320037842
    std: 0.0002973251885217321
    unit: "mA"
    value: 0.02556405154367288
    valueSI: 0.00002556405154367288
    unitSI: "A"

However, important scan quantities which are different for each scan, e.g. like a scan command, we need to store as a list of strings.

scientificMetadata:
  ScanCommand: 
    - "ascan exp_mot02 0.0 6.0 6 0.1"
    - "ascan exp_mot01 0.1 6.0 6 0.1"
    - "ascan exp_mot02 0.0 5.0 6 0.1"

Similarly our users/beamline-scientiests have request to aggregate also numerical physical quantities which are changing from scan to scan, i.e. to store them in a list, one value for each scans in a measurement. For such quantities average is not useful e.g. neither in Poisson nor in Gaussian distribution, e.,g. some motor positions.

A number of scan for each measurement could be different e.g. 1 or 1000. The aim to do it is to reduces data size storage, i.e. do not store duplicated metadata in a series of similar scans.

Of course all our solutions still under discussion so we don't know what will be the structure of our final production datasets.

nitrosx commented 7 months ago

@jkotan thank you so much for the explanation and the examples.

I think that a possible solution for you would be the following flow:

Producing the metadata for the measurement datasets, as you say in your post, implies that some metadata entry will go from a single value to a list or time series. At this point, I would start to ask myself if the resulting list or time series is still just metadata or it has become data. One possible solution is that the resulting measurement datasets has an additional data file with the list / time series and in metadata, we would insert an entry with the summary information, like min, max and number of values or delta.

If we apply this to your third example:

scientificMetadata:
  ScanCommand: 
    - "ascan exp_mot02 0.0 6.0 6 0.1"
    - "ascan exp_mot01 0.1 6.0 6 0.1"
    - "ascan exp_mot02 0.0 5.0 6 0.1"

the measurement dataset will have an additional data file containing the full list of scan commands, like:

- scan: 1
  command: "ascan exp_mot02 0.0 6.0 6 0.1"
- scan: 2
  command: "ascan exp_mot01 0.1 6.0 6 0.1"
- scan : 3
  command: "ascan exp_mot02 0.0 5.0 6 0.1"
- ...

while in the metadata, we would insert a summary of those. The metadata fields of the summary will depend on what is important for the users when they are searching for such datasets. A possible metadata entries could be:

scientificMetadata:
  scan_command_main: "ascan"
  scan_command_motors: ["exp_mot01", "exp_mot02"]
  scan_command_parameter_1: 0.0
  scan_command_parameter_2_min: 5.0
  scan_command_parameter_2_max: 6.0
  scan_command_parameter_2_number_of_values: 2
  scan_command_parameter_3: 6
  scan_command_parameter_4: 0.1
sbliven commented 7 months ago

This seems like a particular schema that should be validated under #966

dylanmcreynolds commented 7 months ago

A couple of notes. As @sbliven pointed out in the regular developer meeting, a lot of what you're trying to accomplish is probably search and not storage/tagging.

Second, Mongo is not a great engine for storing very large numerical data. Serializing large 3d data into json so that it can go into Mongo is very inefficient, and searching for it would probably be likewise inefficient.

Another point to note is that you're definition of data type mixes the concepts of dimension, shape and datatype. FWIW, there are widely-used frameworks out there that have conventions for this already. If your user is using scientific python, then take a look at Numpy and its arrays. dtype doesn't give you everything you've specified (definitely not ISO8601) but it gives you a lot.

We plan to use tiled for servicing arrays and tables from source data, and will set next to SciCat. This does not address your search issue, however.

minottic commented 7 months ago

could these be interesting, as a way to delegate complex/custom search needs to elasticsearch and thus to the adopting facility?

From what I understood from a quick read, one can create in elasticsearch custom pipelines ("analysers") which are executed before index creation. These analysers apply sequentially a set of "filters" which can be user-defined. An application could be this issue, as the unit conversion on arrays and the sequential search could all be covered by ELS with a custom filter and analyser.

sbliven commented 7 months ago

General principle

Allow me to repeat and expand on some general comments from the verbal discussion:

I think we should commit to not enforcing any particular structure on the scientificMetadata for a generic SciCat instance. Instead, we should think of features depending on metadata structure as "progressive enhancements," where SciCat can provide additional functionality for datasets that do follow some standard structure. I would suggest organizing our issues not around the metadata structure but rather around what features we would like to implement.

The major features which depend on metadata structure are:

Any I missed?

Search

Getting back to the issue at hand, which I think relates only to the two search features. @nitrosx has a good summary of the data types/shapes we would like to search by, as well as some of the operators we want for each type. I think the next step is to look at available search technologies (loopback, elastic search, GraphQL, etc) and see whether they could support these. If so then they likely already have some preferred syntax for specifying the data types (eg JSON-LD).

nitrosx commented 7 months ago

I spent sometimes reading about dimensionality in numpy. Here are the two resources that I read:

This helped me clarifying better the quantity case reported in the original post above. I would like to clarify what I meant by _x_d-quantity in metadata. The goal is to allow users to create metadata entries of type quantity (aka measurement with units) with dimensionality 1 and of size 1, 2 and 3. To translate in numpy terminology, a quantity is an array with one dimension and allowed size of 1, 2 and 3. This will required to specify the query syntax for quantity, which I agree with @sbliven, we should do some research to find out if there is any best practice or standard and adopt it.

bpedersen2 commented 7 months ago

PSI probably wants support for entries like in https://discovery.psi.ch/datasets/20.500.11935%2Fc5bce731-55fc-4c57-b049-6c32ad6601c4

sbliven commented 7 months ago

PSI probably wants support for entries like in https://discovery.psi.ch/datasets/20.500.11935%2Fc5bce731-55fc-4c57-b049-6c32ad6601c4

Great to see "dummy" test data in our production instance 🙄

PSI certainly has use cases where we might include vectors. However I think it's an ongoing discussion whether this should be metadata in SciCat.

@bpedersen2 How would you suggest querying this data? Is there a preexisting query language that would support vectors, ranges, quantities, etc? I tried following your code for SI quantities but didn't grasp how it gets integrated into search or the frontend.

nitrosx commented 6 months ago

I would argue that vectors this long are not really metadata. They should be in data and have summary property in the metadata. Something like:

bpedersen2 commented 6 months ago

@bpedersen2 How would you suggest querying this data? Is there a preexisting query language that would support vectors, ranges, quantities, etc? I tried following your code for SI quantities but didn't grasp how it gets integrated into search or the frontend.

currently this is not supported.

How searching currently works:

FE: There are fixed terms defined for possible relations: (https://github.com/SciCatProject/frontend/blob/3e0aee212c953c56511d46307ccf30862d06162f/src/app/state-management/models/index.ts#L81)

type ScientificConditionRelation =
  | "EQUAL_TO_NUMERIC"
  | "EQUAL_TO_STRING"
  | "GREATER_THAN"
  | "LESS_THAN";

These are passed together with a field spec (lhs) and the user suppleid value(rhs) to the BE.

BE: These strings are used in a switch to generate a suitable mongo query