TheELNConsortium / TheELNFileFormat

Specification for the ELN File Format
MIT License
48 stars 10 forks source link

Flexible metadata in .eln files #58

Open FlorianRhiem opened 11 months ago

FlorianRhiem commented 11 months ago

Currently, the .eln format does not have a unified way of exporting flexible metadata. Instead, most ELNs export a data structure specific to that ELN in JSON format which contains various information about a dataset, including some flexible metadata. As they are (mostly) representations of internal models, they vary quite a bit.

Motivation

While it is already useful to be able to reference samples, measurements and other objects from other ELNs with some generic metadata such as the creation and modification times and the author, it would be even better if we could exchange flexible metadata about these. In the last meeting, we briefly discussed the goal of a "gold standard" experiment that can be represented as an .eln file, imported and exported by the various ELNs. For this, we should be able to exchange information such as instrument or process parameters. We cannot expect to strictly define these, instead they should map a (textual) identifier to data of some type.

Ideas / Suggestions

For mapping identifiers to values, the PropertyValue should be useful, as it can map its propertyID to its value, which can be a boolean, text, a number or the generic StructuredValue type, and also supports units and a human-readable version as a fallback. So, if we had to store a temperature with the identifier target_temperature, it could be represented as:

{
  "@type": "PropertyValue"
  "propertyId": "target_temperature",
  "value": 293.15,
  "unitCode": "KEL",
  "unitText": "K",
  "description": "293.15 K"
}

while a boolean instrument setting could be represented like this:

{
  "@type": "PropertyValue"
  "propertyId": "vacuum_enabled",
  "value": false
}

If we would provide an array of such property values, we could support a flat mapping of identifiers to values. Such an array is part of the Dataset type we use for datasets in the ro-crate-metadata.json in the property variableMeasured, however this use case goes beyond the "variables that are measured in some dataset". So we could either use that property and "stretch" its definition by a fair bit, use another existing property, or branch off there and define a custom property.

As PropertyValue objects can contain a value of StructuredValue type, of which PropertyValue is a sub-type off, it might also be possible to implement nested data structures like this. Alternatively, the structure could be represented in the identifier.


What are your thoughts on storing flexible metadata in PropertyValue objects? Which solution for attaching these to the datasets do you prefer? How should we deal with nested metadata? Would you prefer to store flexible metadata outside the ro-crate-metadata.json entirely or in a custom format instead?

NicolasCARPi commented 11 months ago

In eLab, what you suggest would probably work well with the "Extra fields" feature: https://doc.elabftw.net/metadata.html#example-with-number-and-units.

And PropertyValue seems like a good fit.

How should we deal with nested metadata?

Flatten/normalize it.

Would you prefer to store flexible metadata outside the ro-crate-metadata.json entirely or in a custom format instead?

No, we should put as much as we can in the metadata.yml file, exactly as you're suggesting.

bronger commented 11 months ago

Unfortunately, I could not attend to the latest video call, so I may not understand you correctly. But isn’t the actual problem for the receiving ELN to understand the semantics of those fields?

FlorianRhiem commented 11 months ago

Unfortunately, I could not attend to the latest video call, so I may not understand you correctly.

Prior to the meeting it had been suggested that we should pick an experiment that can be represented by an .eln file and imported/exported without significant loss of information. It should be similar when exported by all ELNs, which also has advantages such as making a comparison of the generated .eln files easier. We did not discuss much more than clarifying that this is a future goal. Storing flexible metadata about an experiment is a step in that direction.

But isn’t the actual problem for the receiving ELN to understand the semantics of those fields?

This may be splitting hairs, but I do not think the receiving ELN has to understand the semantics, rather the users of the receiving ELN have to understand them. This can be achieved by utilizing the propertyID property, which can make use of whatever ontology exists in the specific field and whatever names or labels are associated with a specific value. In addition, the measurementTechnique and measurementMethod properties might help give some context, where applicable.

bronger commented 11 months ago

I you don’t care about semantics, I don’t see the problem in just displaying

most ELNs export a data structure specific to that ELN

on the receiving side.

FlorianRhiem commented 11 months ago

I think the user experience is improved quite a bit if the ELN can display properties with identifiers the users can understand and values of a type the ELN knows how to display, rather than requiring that users read through various files in the hope of finding what they're looking for in a syntax and structure they know how to understand.

bronger commented 11 months ago

Ah, now I see the problem … thanks for the explanation!

FlorianRhiem commented 10 months ago

(Note: I tried to post this earlier, but the comment seems to have gone lost. Apologies if it ends up as a duplicate)

Here is the current ELN example with added flexible metadata: sampledb_export_with_flexible_metadata.zip (renamed to zip to so I can upload it in a GitHub comment)

I've implemented the export fairly directly, here are a few examples for data types exported as PropertyValue, along with some thoughts on those:

Text

{
  "value": "OMBE-1",
  "propertyID": "name",
  "name": "Sample Name"
}

Text is directly supported for value, so this is straightforward and can also serve as the fallback datatype as long as the data has a text representation.

Boolean

{
  "value": false,
  "propertyID": "checkbox",
  "name": "Checkbox"
}

This is also directly supported and fairly simple to implement.

Quantity

 {
    "value": 5.0,
    "unitText": "\u00c5",
    "propertyID": "multilayer.0.films.0.thickness",
    "name": "Multilayers \u2192 0 \u2192 Films \u2192 0 \u2192 Film Thickness",
    "unitCode": "A11"
  }

While this is directly supported, I found UN CEFACT codes to be a bit annoying to work with, but for those cases where a unit code does exist, they seem like they could genuinely avoid confusion between unit notations. This example also shows a fairly deeply nested property.

Datetime

{
  "value": "2017-02-24 11:56:00",
  "propertyID": "created",
  "name": "Creation Datetime"
}

This is just the date and time in UTC as it is used in SampleDB, though it might make sense to use ISO 8601 notation with time zone offset and possibly microsecond precision, just in case either of those are needed? In that case, this would be:

{
  "value": "2017-02-24T11:56:00.000000+00:00",
  "propertyID": "created",
  "name": "Creation Datetime"
}

There is no clear indication that this is a datetime instead of a text beyond the format used, so a suggestion for how to clearly denote this to be a datetime would be welcome. Then again, this is quite the unlikely format to be fulfilled by accident, so a regular expression (or just an attempt to parse it as a date) should suffice.

Object reference

{
  "value": "./objects/1",
  "propertyID": "sample",
  "name": "Sample"
}

This uses the .eln internal @id value for the referenced object. As value cannot be a Dataset, it isn't a direct reference. The datasets URL might be an alternative, which would also allow easily referencing objects that are not included in the .eln export. In this case, this might be:

{
  "value": "http://localhost:5000/objects/1",
  "propertyID": "sample",
  "name": "Sample"
}
FlorianRhiem commented 8 months ago

I think those examples above all lack "@type": "PropertyValue". I can re-generate an example file with that fixed, if anyone wants to try importing the data.

SteffenBrinckmann commented 8 months ago

How do we want to flatten and afterwards json? Do we want to use the right-arrow that sampleDB is using (see that example below) or do we want to use the '/' which I find more common?

I paste here the example of sampleDB: so you don't have to search { "value": "Seed Layer", "propertyID": "multilayer.0.films.0.name", "name": "Multilayers \u2192 0 \u2192 Films \u2192 0 \u2192 Film Name" }, { "value": "Fe", "propertyID": "multilayer.0.films.0.elements.0.name", "name": "Multilayers \u2192 0 \u2192 Films \u2192 0 \u2192 Elements \u2192 0 \u2192 Element Name" }, { "value": 0.09999999999999999, "unitText": "\u00c5/s", "propertyID": "multilayer.0.films.0.elements.0.rate", "name": "Multilayers \u2192 0 \u2192 Films \u2192 0 \u2192 Elements \u2192 0 \u2192 Rate" }, { "value": 5.0, "unitText": "\u00c5", "propertyID": "multilayer.0.films.0.thickness", "name": "Multilayers \u2192 0 \u2192 Films \u2192 0 \u2192 Film Thickness", "unitCode": "A11" },

FlorianRhiem commented 8 months ago

For that, we also have to differentiate between propertyID, the internal identifier, and name, the display text. In that example, name uses the right arrow as a user readable version, and propertyID uses a dot. The propertyID values in that example match the way these nested properties are also used for search in SampleDB, the name generation was simply copied from the Dataverse export for SampleDB. I think / would be fine, too. As long as it is clear for the users, I don't have a strong opinion there.

SteffenBrinckmann commented 8 months ago

Generally, I don't like double properties: storing propertyID and name which are almost identical (except of the capitalization and the separation symbol). Inconsistencies between both entries might then lead to strange behavior for the user. Uniqueness is important to me. I would suggest to make propertyID required and the key that determines hierarchy of the unflatten. "name" is optional and just for display. Would that make sense?

FlorianRhiem commented 8 months ago

I think that is how it is intended, propertyID as the "commonly used identifier for the characteristic represented by the property" and name just being the generic name of a Thing.

NicolasCARPi commented 6 months ago

I started implementing metadata in the ro-crate file.

I added an elabftw_all property that contains everything that cannot fit in the normal properties. This will allow me to reimport it without loss of information if the source is an eLab instance.

EDIT

I'll probably remove that, and instead add elabftw_metadata to the Dataset as a string that I can import directly with no hassle. /end edit

This introduces the concept of namespaced custom properties. Other ELN must ignore them. And they should be avoided as much as possible.

EDIT 2

ok so instead I'll use elabftw_metadata as propertyID of a PropertyValue

Currently (WIP) this is what it looks like for this input form:

2024-05-16-015557_3840x1080_scrot

"variableMeasured": [
        {
          "propertyID": "elabftw_metadata",
          "description": "eLabFTW metadata JSON as string",
          "value": "{\"extra_fields\": {\"multi select\": {\"type\": \"select\", \"value\": \"Paris\", \"options\": [\"Paris\", \"Londres\", \"Tokyo\", \"Madrid\"], \"position\": 1, \"allow_multi_values\": true}, \"with comment\": {\"type\": \"text\", \"value\": \"yep\", \"position\": 0, \"readonly\": true, \"required\": true, \"description\": \"this is the description\", \"blank_value_on_duplicate\": true}, \"num with unit\": {\"type\": \"number\", \"unit\": \"unit 2\", \"units\": [\"unit 1\", \"unit 2\", \"unit 3\"], \"value\": \"23\", \"position\": 2, \"description\": \"yep\"}, \"a dropdown menu\": {\"type\": \"select\", \"value\": \"choice 2\", \"options\": [\"choice 1\", \"choice 2\", \"choice 3\"], \"position\": 3, \"required\": true, \"description\": \"this one does not allow multiple selection\", \"blank_value_on_duplicate\": true}, \"a straightforward text input\": {\"type\": \"text\", \"value\": \"it contains a text value\", \"position\": 4, \"description\": \"this is the default input\"}}}"
        },
        {
          "propertyID": "multi select",
          "value": "Paris",
          "description": null,
          "unitText": null,
          "valueReference": "select"
        },
        {
          "propertyID": "with comment",
          "value": "yep",
          "description": "this is the description",
          "unitText": null,
          "valueReference": "text"
        },
        {
          "propertyID": "num with unit",
          "value": "23",
          "description": "yep",
          "unitText": "unit 2",
          "valueReference": "number"
        },
        {
          "propertyID": "a dropdown menu",
          "value": "choice 2",
          "description": "this one does not allow multiple selection",
          "unitText": null,
          "valueReference": "select"
        },
        {
          "propertyID": "a straightforward text input",
          "value": "it contains a text value",
          "description": "this is the default input",
          "unitText": null,
          "valueReference": "text"
        }
      ]
    },
NicolasCARPi commented 5 months ago

Here is what it looks like currently:

      "variableMeasured": [
        {
          "propertyID": "elabftw_metadata",
          "description": "eLabFTW metadata JSON as string",
          "value": "{\"elabftw\": {\"display_m...[skipped for brevity]..."
        },
        {
          "propertyID": "Number",
          "valueReference": "number",
          "value": "",
          "description": "no units"
        },
        {
          "propertyID": "Type URL",
          "valueReference": "url",
          "value": "https://www.elabftw.net",
          "description": "a link (readonly)"
        },
        {
          "propertyID": "Just time",
          "valueReference": "time",
          "value": "17:00",
          "description": "tea time"
        },
        {
          "propertyID": "Some date",
          "valueReference": "date",
          "value": "2024-07-14",
          "description": "is a date"
        },
        {
          "propertyID": "Type user",
          "valueReference": "users",
          "value": 1,
          "description": "this is a link to a user"
        },
        {
          "propertyID": "A checkbox",
          "valueReference": "checkbox",
          "value": "on",
          "description": "is checked"
        },
        {
          "propertyID": "Email input",
          "valueReference": "email",
          "value": "louis@example.com",
          "description": "type email"
        },
        {
          "propertyID": "Date and time",
          "valueReference": "datetime-local",
          "value": "2024-07-14T13:37",
          "description": "datetime description"
        },
        {
          "propertyID": "Radio buttons",
          "valueReference": "radio",
          "value": "Oui",
          "description": "radio description"
        },
        {
          "propertyID": "Type resource",
          "valueReference": "items",
          "value": 208,
          "description": "This is a link to a resource"
        },
        {
          "propertyID": "A dropdown menu",
          "valueReference": "select",
          "value": "Choice 1",
          "description": "Single select"
        },
        {
          "propertyID": "Text input name",
          "valueReference": "text",
          "value": "some text",
          "description": "type text + all attributes"
        },
        {
          "propertyID": "Type experiment",
          "valueReference": "experiments",
          "value": 373,
          "description": "This is a link to an experiment"
        },
        {
          "propertyID": "Number with units",
          "valueReference": "number",
          "value": "",
          "description": "this one has units",
          "unitText": "mM"
        },
        {
          "propertyID": "Unchecked checkbox",
          "valueReference": "checkbox",
          "value": "",
          "description": "this one is not checked"
        },
        {
          "propertyID": "Multi dropdown menu",
          "valueReference": "select",
          "value": "Option 1",
          "description": "Allows multiple selection"
        }
      ]
    }
  ]
}

edit: realizing now that dropdown menu lose their other options...

jmanideep commented 4 months ago

Here is an example of deeply nested metadata from Kadi4Mat.

In Kadi4Mat, the metadata can be organized using nested types (along with primitive data-types). The following nested value types are available:

Dictionary: A nested value that combines multiple metadata entries under a single key. In the example below, Instrument.manufacturer is a dictionary containing manufacturerName as a nested key.

List: A nested value similar to dictionaries, but without keys for the values. In the example below, Instrument.Detector is a list where each item is an entry without a key, referenced by its index.

[
    {
       "@type": "PropertyValue",
       "additionalType": "str",
       "description": "Name of the instrument",
       "identifier": "https://schema.org/name",
       "propertyID": "Instrument.name",
       "value": "SEM"
    },
    {
       "@type": "PropertyValue",
       "additionalType": "str",
       "propertyID": "Instrument.manufacturer.manufacturerName",
       "value": null
    },
    {
       "@type": "PropertyValue",
       "additionalType": "float",
       "propertyID": "Instrument.Settings.beam spot size",
       "value": 1.2,
       "unitText":"mm"
    },
    {
       "@type": "PropertyValue",
       "additionalType": "str",
       "propertyID": "Instrument.Detector.0",
       "value": "EDT",
    },
    {
       "@type": "PropertyValue",
       "additionalType": "str",
       "propertyID": "Instrument.Detector.1",
       "value": "CDEM",
    }
]
bronger commented 4 months ago

The receiving ELN will display to the user “Instrument.Settings.beam spot size: 1.2mm”?

jmanideep commented 4 months ago

If the receiving ELN doesn't support nested entries, the propertyID or name in the property values should be splitted at the preferred separator, for example like in the SampleDB metadata.

bronger commented 4 months ago

If an ELN finds the graph triple

http://institute.com/samples/3276
http://institute.com/methods/PDS/settings#beam%20spot%20size
1.2

it can also display that nicely to users.

I see the need for PropertyValue if you are forced to use only schema.org, but in RO-Crates, arbitrary vocabularies are allowed next to schema.org.

NicolasCARPi commented 4 months ago

@jmanideep can you provide a .eln with such metadata so we can test that easily?

jmanideep commented 4 months ago

Here is the ELN file instrument-used-in-experiment.zip from Kadi4Mat.

NicolasCARPi commented 4 months ago

@jmanideep don't you have sha256 sum for attached files? Also, there is no Author node, is this expected?

jmanideep commented 4 months ago

I accidentally filtered out the author node during export, but in general, it will be there. Here is the update file instrument-used-in-experiment.zip with author node.

And regarding sha256 checksum, we don't include it currently, as is the case in our regular example.

FlorianRhiem commented 2 months ago

For future reference: During yesterday's meeting, we've agreed on using . as the separator for nested properties' propertyID values, as shown in the Kadi4Mat example above.

NicolasCARPi commented 1 week ago

@FlorianRhiem do you wish to take a stab at adding a section in the SPECIFICATION about how we handle arbitrary metadata in a .eln? Mainly the point about the . separator, and that we must use variableMeasured attribute with PropertyValue objects.