Open FlorianRhiem opened 11 months ago
In eLab, what you suggest would probably work well with the "Extra fields" feature: https://doc.elabftw.net/metadata.html#example-with-number-and-units.
And PropertyValue
seems like a good fit.
How should we deal with nested metadata?
Flatten/normalize it.
Would you prefer to store flexible metadata outside the ro-crate-metadata.json entirely or in a custom format instead?
No, we should put as much as we can in the metadata.yml file, exactly as you're suggesting.
Unfortunately, I could not attend to the latest video call, so I may not understand you correctly. But isn’t the actual problem for the receiving ELN to understand the semantics of those fields?
Unfortunately, I could not attend to the latest video call, so I may not understand you correctly.
Prior to the meeting it had been suggested that we should pick an experiment that can be represented by an .eln file and imported/exported without significant loss of information. It should be similar when exported by all ELNs, which also has advantages such as making a comparison of the generated .eln files easier. We did not discuss much more than clarifying that this is a future goal. Storing flexible metadata about an experiment is a step in that direction.
But isn’t the actual problem for the receiving ELN to understand the semantics of those fields?
This may be splitting hairs, but I do not think the receiving ELN has to understand the semantics, rather the users of the receiving ELN have to understand them. This can be achieved by utilizing the propertyID property, which can make use of whatever ontology exists in the specific field and whatever names or labels are associated with a specific value. In addition, the measurementTechnique and measurementMethod properties might help give some context, where applicable.
I you don’t care about semantics, I don’t see the problem in just displaying
most ELNs export a data structure specific to that ELN
on the receiving side.
I think the user experience is improved quite a bit if the ELN can display properties with identifiers the users can understand and values of a type the ELN knows how to display, rather than requiring that users read through various files in the hope of finding what they're looking for in a syntax and structure they know how to understand.
Ah, now I see the problem … thanks for the explanation!
(Note: I tried to post this earlier, but the comment seems to have gone lost. Apologies if it ends up as a duplicate)
Here is the current ELN example with added flexible metadata: sampledb_export_with_flexible_metadata.zip (renamed to zip to so I can upload it in a GitHub comment)
I've implemented the export fairly directly, here are a few examples for data types exported as PropertyValue
, along with some thoughts on those:
{
"value": "OMBE-1",
"propertyID": "name",
"name": "Sample Name"
}
Text is directly supported for value, so this is straightforward and can also serve as the fallback datatype as long as the data has a text representation.
{
"value": false,
"propertyID": "checkbox",
"name": "Checkbox"
}
This is also directly supported and fairly simple to implement.
{
"value": 5.0,
"unitText": "\u00c5",
"propertyID": "multilayer.0.films.0.thickness",
"name": "Multilayers \u2192 0 \u2192 Films \u2192 0 \u2192 Film Thickness",
"unitCode": "A11"
}
While this is directly supported, I found UN CEFACT codes to be a bit annoying to work with, but for those cases where a unit code does exist, they seem like they could genuinely avoid confusion between unit notations. This example also shows a fairly deeply nested property.
{
"value": "2017-02-24 11:56:00",
"propertyID": "created",
"name": "Creation Datetime"
}
This is just the date and time in UTC as it is used in SampleDB, though it might make sense to use ISO 8601 notation with time zone offset and possibly microsecond precision, just in case either of those are needed? In that case, this would be:
{
"value": "2017-02-24T11:56:00.000000+00:00",
"propertyID": "created",
"name": "Creation Datetime"
}
There is no clear indication that this is a datetime instead of a text beyond the format used, so a suggestion for how to clearly denote this to be a datetime would be welcome. Then again, this is quite the unlikely format to be fulfilled by accident, so a regular expression (or just an attempt to parse it as a date) should suffice.
{
"value": "./objects/1",
"propertyID": "sample",
"name": "Sample"
}
This uses the .eln internal @id
value for the referenced object. As value
cannot be a Dataset
, it isn't a direct reference. The datasets URL might be an alternative, which would also allow easily referencing objects that are not included in the .eln export. In this case, this might be:
{
"value": "http://localhost:5000/objects/1",
"propertyID": "sample",
"name": "Sample"
}
I think those examples above all lack "@type": "PropertyValue"
. I can re-generate an example file with that fixed, if anyone wants to try importing the data.
How do we want to flatten and afterwards json? Do we want to use the right-arrow that sampleDB is using (see that example below) or do we want to use the '/' which I find more common?
I paste here the example of sampleDB: so you don't have to search { "value": "Seed Layer", "propertyID": "multilayer.0.films.0.name", "name": "Multilayers \u2192 0 \u2192 Films \u2192 0 \u2192 Film Name" }, { "value": "Fe", "propertyID": "multilayer.0.films.0.elements.0.name", "name": "Multilayers \u2192 0 \u2192 Films \u2192 0 \u2192 Elements \u2192 0 \u2192 Element Name" }, { "value": 0.09999999999999999, "unitText": "\u00c5/s", "propertyID": "multilayer.0.films.0.elements.0.rate", "name": "Multilayers \u2192 0 \u2192 Films \u2192 0 \u2192 Elements \u2192 0 \u2192 Rate" }, { "value": 5.0, "unitText": "\u00c5", "propertyID": "multilayer.0.films.0.thickness", "name": "Multilayers \u2192 0 \u2192 Films \u2192 0 \u2192 Film Thickness", "unitCode": "A11" },
For that, we also have to differentiate between propertyID
, the internal identifier, and name
, the display text. In that example, name
uses the right arrow as a user readable version, and propertyID
uses a dot. The propertyID
values in that example match the way these nested properties are also used for search in SampleDB, the name
generation was simply copied from the Dataverse export for SampleDB. I think /
would be fine, too. As long as it is clear for the users, I don't have a strong opinion there.
Generally, I don't like double properties: storing propertyID and name which are almost identical (except of the capitalization and the separation symbol). Inconsistencies between both entries might then lead to strange behavior for the user. Uniqueness is important to me. I would suggest to make propertyID required and the key that determines hierarchy of the unflatten. "name" is optional and just for display. Would that make sense?
I think that is how it is intended, propertyID
as the "commonly used identifier for the characteristic represented by the property" and name
just being the generic name of a Thing
.
I started implementing metadata in the ro-crate file.
I added an elabftw_all
property that contains everything that cannot fit in the normal properties. This will allow me to reimport it without loss of information if the source is an eLab instance.
I'll probably remove that, and instead add elabftw_metadata
to the Dataset
as a string that I can import directly with no hassle.
/end edit
This introduces the concept of namespaced custom properties. Other ELN must ignore them. And they should be avoided as much as possible.
ok so instead I'll use elabftw_metadata
as propertyID
of a PropertyValue
Currently (WIP) this is what it looks like for this input form:
"variableMeasured": [
{
"propertyID": "elabftw_metadata",
"description": "eLabFTW metadata JSON as string",
"value": "{\"extra_fields\": {\"multi select\": {\"type\": \"select\", \"value\": \"Paris\", \"options\": [\"Paris\", \"Londres\", \"Tokyo\", \"Madrid\"], \"position\": 1, \"allow_multi_values\": true}, \"with comment\": {\"type\": \"text\", \"value\": \"yep\", \"position\": 0, \"readonly\": true, \"required\": true, \"description\": \"this is the description\", \"blank_value_on_duplicate\": true}, \"num with unit\": {\"type\": \"number\", \"unit\": \"unit 2\", \"units\": [\"unit 1\", \"unit 2\", \"unit 3\"], \"value\": \"23\", \"position\": 2, \"description\": \"yep\"}, \"a dropdown menu\": {\"type\": \"select\", \"value\": \"choice 2\", \"options\": [\"choice 1\", \"choice 2\", \"choice 3\"], \"position\": 3, \"required\": true, \"description\": \"this one does not allow multiple selection\", \"blank_value_on_duplicate\": true}, \"a straightforward text input\": {\"type\": \"text\", \"value\": \"it contains a text value\", \"position\": 4, \"description\": \"this is the default input\"}}}"
},
{
"propertyID": "multi select",
"value": "Paris",
"description": null,
"unitText": null,
"valueReference": "select"
},
{
"propertyID": "with comment",
"value": "yep",
"description": "this is the description",
"unitText": null,
"valueReference": "text"
},
{
"propertyID": "num with unit",
"value": "23",
"description": "yep",
"unitText": "unit 2",
"valueReference": "number"
},
{
"propertyID": "a dropdown menu",
"value": "choice 2",
"description": "this one does not allow multiple selection",
"unitText": null,
"valueReference": "select"
},
{
"propertyID": "a straightforward text input",
"value": "it contains a text value",
"description": "this is the default input",
"unitText": null,
"valueReference": "text"
}
]
},
Here is what it looks like currently:
"variableMeasured": [
{
"propertyID": "elabftw_metadata",
"description": "eLabFTW metadata JSON as string",
"value": "{\"elabftw\": {\"display_m...[skipped for brevity]..."
},
{
"propertyID": "Number",
"valueReference": "number",
"value": "",
"description": "no units"
},
{
"propertyID": "Type URL",
"valueReference": "url",
"value": "https://www.elabftw.net",
"description": "a link (readonly)"
},
{
"propertyID": "Just time",
"valueReference": "time",
"value": "17:00",
"description": "tea time"
},
{
"propertyID": "Some date",
"valueReference": "date",
"value": "2024-07-14",
"description": "is a date"
},
{
"propertyID": "Type user",
"valueReference": "users",
"value": 1,
"description": "this is a link to a user"
},
{
"propertyID": "A checkbox",
"valueReference": "checkbox",
"value": "on",
"description": "is checked"
},
{
"propertyID": "Email input",
"valueReference": "email",
"value": "louis@example.com",
"description": "type email"
},
{
"propertyID": "Date and time",
"valueReference": "datetime-local",
"value": "2024-07-14T13:37",
"description": "datetime description"
},
{
"propertyID": "Radio buttons",
"valueReference": "radio",
"value": "Oui",
"description": "radio description"
},
{
"propertyID": "Type resource",
"valueReference": "items",
"value": 208,
"description": "This is a link to a resource"
},
{
"propertyID": "A dropdown menu",
"valueReference": "select",
"value": "Choice 1",
"description": "Single select"
},
{
"propertyID": "Text input name",
"valueReference": "text",
"value": "some text",
"description": "type text + all attributes"
},
{
"propertyID": "Type experiment",
"valueReference": "experiments",
"value": 373,
"description": "This is a link to an experiment"
},
{
"propertyID": "Number with units",
"valueReference": "number",
"value": "",
"description": "this one has units",
"unitText": "mM"
},
{
"propertyID": "Unchecked checkbox",
"valueReference": "checkbox",
"value": "",
"description": "this one is not checked"
},
{
"propertyID": "Multi dropdown menu",
"valueReference": "select",
"value": "Option 1",
"description": "Allows multiple selection"
}
]
}
]
}
edit: realizing now that dropdown menu lose their other options...
Here is an example of deeply nested metadata from Kadi4Mat.
In Kadi4Mat, the metadata can be organized using nested types (along with primitive data-types). The following nested value types are available:
Dictionary: A nested value that combines multiple metadata entries under a single key. In the example below, Instrument.manufacturer
is a dictionary containing manufacturerName
as a nested key.
List: A nested value similar to dictionaries, but without keys for the values. In the example below, Instrument.Detector
is a list where each item is an entry without a key, referenced by its index.
[
{
"@type": "PropertyValue",
"additionalType": "str",
"description": "Name of the instrument",
"identifier": "https://schema.org/name",
"propertyID": "Instrument.name",
"value": "SEM"
},
{
"@type": "PropertyValue",
"additionalType": "str",
"propertyID": "Instrument.manufacturer.manufacturerName",
"value": null
},
{
"@type": "PropertyValue",
"additionalType": "float",
"propertyID": "Instrument.Settings.beam spot size",
"value": 1.2,
"unitText":"mm"
},
{
"@type": "PropertyValue",
"additionalType": "str",
"propertyID": "Instrument.Detector.0",
"value": "EDT",
},
{
"@type": "PropertyValue",
"additionalType": "str",
"propertyID": "Instrument.Detector.1",
"value": "CDEM",
}
]
propertyID
: The name of the property, indicating the flattened dictionary with dot(.)
as separator for keys. For list types, the index of the list item is appended after the separator. For example:
Instrument.name
: Refers to the name
key under the Instrument
dictionary.Instrument.Detector.0
: Refers to the first item (index 0) in the Detector
list within Instrument
dictionary.additionalType
: The data type of the value.
identifier
: An IRI specifying an existing term that the metadata should represent.
The receiving ELN will display to the user “Instrument.Settings.beam spot size: 1.2mm”?
If the receiving ELN doesn't support nested entries, the propertyID or name in the property values should be splitted at the preferred separator, for example like in the SampleDB metadata.
If an ELN finds the graph triple
http://institute.com/samples/3276
http://institute.com/methods/PDS/settings#beam%20spot%20size
1.2
it can also display that nicely to users.
I see the need for PropertyValue if you are forced to use only schema.org, but in RO-Crates, arbitrary vocabularies are allowed next to schema.org.
@jmanideep can you provide a .eln with such metadata so we can test that easily?
Here is the ELN file instrument-used-in-experiment.zip from Kadi4Mat.
@jmanideep don't you have sha256 sum for attached files? Also, there is no Author node, is this expected?
I accidentally filtered out the author node during export, but in general, it will be there. Here is the update file instrument-used-in-experiment.zip with author node.
And regarding sha256 checksum, we don't include it currently, as is the case in our regular example.
For future reference: During yesterday's meeting, we've agreed on using .
as the separator for nested properties' propertyID
values, as shown in the Kadi4Mat example above.
@FlorianRhiem do you wish to take a stab at adding a section in the SPECIFICATION about how we handle arbitrary metadata in a .eln? Mainly the point about the .
separator, and that we must use variableMeasured
attribute with PropertyValue
objects.
Currently, the .eln format does not have a unified way of exporting flexible metadata. Instead, most ELNs export a data structure specific to that ELN in JSON format which contains various information about a dataset, including some flexible metadata. As they are (mostly) representations of internal models, they vary quite a bit.
Motivation
While it is already useful to be able to reference samples, measurements and other objects from other ELNs with some generic metadata such as the creation and modification times and the author, it would be even better if we could exchange flexible metadata about these. In the last meeting, we briefly discussed the goal of a "gold standard" experiment that can be represented as an .eln file, imported and exported by the various ELNs. For this, we should be able to exchange information such as instrument or process parameters. We cannot expect to strictly define these, instead they should map a (textual) identifier to data of some type.
Ideas / Suggestions
For mapping identifiers to values, the PropertyValue should be useful, as it can map its
propertyID
to itsvalue
, which can be a boolean, text, a number or the genericStructuredValue
type, and also supports units and a human-readable version as a fallback. So, if we had to store a temperature with the identifiertarget_temperature
, it could be represented as:while a boolean instrument setting could be represented like this:
If we would provide an array of such property values, we could support a flat mapping of identifiers to values. Such an array is part of the
Dataset
type we use for datasets in thero-crate-metadata.json
in the propertyvariableMeasured
, however this use case goes beyond the "variables that are measured in some dataset". So we could either use that property and "stretch" its definition by a fair bit, use another existing property, or branch off there and define a custom property.As
PropertyValue
objects can contain a value ofStructuredValue
type, of whichPropertyValue
is a sub-type off, it might also be possible to implement nested data structures like this. Alternatively, the structure could be represented in the identifier.What are your thoughts on storing flexible metadata in
PropertyValue
objects? Which solution for attaching these to the datasets do you prefer? How should we deal with nested metadata? Would you prefer to store flexible metadata outside the ro-crate-metadata.json entirely or in a custom format instead?