Open ramonawalls opened 3 years ago
is this really particular to mixs-rdf or a general mixs question?
For anything other than 3 there would need to be a migration plan for existing sample metadata e.g in INSDC
Having a fixed unit is only feasible for some fields that are usually highly standardized, like depth in meters. But there will inevitably be fields where freedom on units will be required (like gram per liter vs. gram per gram soil, which cannot be converted one into the other).
for option 3 => this may include a high chance of typo's in the files provided by users, and a lot of effort/difficulties to parse the files for people that want to use the data.
option 4 sounds better, but what about solving this problem like they did with the DarwinCore archive structure: by using multiple files. A core file could be the MIxS file with the (meta-) data, and an extension could be a list of all the fields in the core file with their unit or a variable that describes the value of that field (e.g. boolean, alphanumerical,...).
Having a fixed unit is only feasible for some fields that are usually highly standardized, like depth in meters. But there will inevitably be fields where freedom on units will be required (like gram per liter vs. gram per gram soil, which cannot be converted one into the other).
good point. So 1 or 2 could not be adapted across all fields. A possibility is a hybrid, with some fields like depth following 1 or 2, and others following 3 or 4. But I think this would be confusing.
for option 3 => this may include a high chance of typo's in the files provided by users, and a lot of effort/difficulties to parse the files for people that want to use the data.
Option 3 is the status quo, and indeed we see lots of junk in submitted data. But I'm not sure this is solved by any of the other options. If people are going to enter junk, they will enter junk.
If it is entered correctly it is not so hard to parse {float} {unit}
into a normalized representation, and this could be done centrally
what about solving this problem like they did with the DarwinCore archive structure: by using multiple files. A core file could be the MIxS file with the (meta-) data, and an extension could be a list of all the fields in the core file with their unit or a variable that describes the value of that field (e.g. boolean, alphanumerical,...).
sorry, I'm not totally following what you mean. Can you give a specific example, e.g a water sample taken at 100m?
If it is entered correctly it is not so hard to parse {float} {unit} into a normalized representation, and this could be done centrally
I'm a bit afraid for this option the separator between {float} and {unit} may be a problem: some will use a space, others may use a tab, some people don't put a separator in between... Also having special characters (e.g. "\" and "-" when people abbreviate units) in between numeric measurements is something that can cause trouble in commonly used software to locally store or analyze the data (e.g. excel, R,...)
sorry, I'm not totally following what you mean. Can you give a specific example, e.g a water sample taken at 100m?
In DarwinCore, there is a central data file with an ID per sample, and extension files with additional data of that sample (e.g. the depth measurement). Combined, these files are called a DarwinCore archive. I was thinking along these lines for the units in MIxS: have you central data file with the water sample and the measurement values linked to MIxS terms (e.g. project_name= prj_1, lat_lon= 66.4 123.7 and depth= 100). Associated with that central file have a 2-column extension file with the MIxS terms that were used (so: project_name, lat_lon and depth) and their respective units (in this example: alphanumeric, decimal degree and meter).
I'm a bit afraid for this option the separator between {float} and {unit} may be a problem: some will use a space, others may use a tab, some people don't put a separator in between...
Indeed! If you look at what is currently in NCBI/EBI BioSamples for the 'depth' field you will find things like:
These are examples of the actual raw value for the depth field
In principle it is possible to detect these errors upstream, at time of submission, and suggest repairs, etc. MIxS provides regex-like structures for every field. In practice this has not been done.
But I am afraid that we'd still get junk if we used a different schema too
Also having special characters (e.g. "" and "-" when people abbreviate units) in between numeric measurements is something that can cause trouble in commonly used software to locally store or analyze the data (e.g. excel, R,...)
True, but perhaps we should treat exchange format from analysis format as separate concerns.
IF (and this is a big if) we can have field values adhere to standards such as {float} {unit}
then we can have easy to use, fast, simple tools that will
depth
to depth_in_meters
(where meter
is the prefered unit; alternatively to a sane unit of choice)
In DarwinCore, there is a central data file with an ID per sample, and extension files with additional data of that sample (e.g. the depth measurement). Combined, these files are called a DarwinCore archive. I was thinking along these lines for the units in MIxS: have you central data file with the water sample and the measurement values linked to MIxS terms (e.g. project_name= prj_1, lat_lon= 66.4 123.7 and depth= 100). Associated with that central file have a 2-column extension file with the MIxS terms that were used (so: project_name, lat_lon and depth) and their respective units (in this example: alphanumeric, decimal degree and meter).
I see. I think this makes sense for DarwinCore, where you have more control over the information ecosystem.
My impression for MIxS (and I am new and not an authority) is that we have less control and that the majority of users work with a excel files and no separate data dictionary, our best bet is to standardize the columns - but others with more history on the project and experience of both MIxS and DarwinCore and other systems may have other perspectives.
I think ideally we would have a native json representation where measurements are fully normalized, and there are optional additional fields for precision, provenance, and so on. This is actually the approach we take for NMDC, where we may have
{id: SAM123,
depth: {raw_value: "1000 m",
unit: "m", ### this would map to UO meters via a JSON-LD context
value: 1000,
precision: ..., ## we don't actually do this but it would be easy to extend this way
was_generated_by: .... ## prov process, e.g. using ORNL Identify tool, or assigned by a metadata curator, ...
},
...
}
of course, we still live in a world of spreadsheets, this is what users like, but we would have a well-defined mapping between a spreadsheet/tsv representation and the structured representation.
Moving this issue to the main. MIxS repo.
In so far as a user supplies a unit, rather than it being baked it, can we give better guidance, formalize this more in the schema, and make better use of international standards.
For example, right now we recommend "meter" as the unit for depth.
e.g
air | alt | altitude | Altitude is a term used to identify heights of objects such as airplanes, space shuttles, rockets, atmospheric balloons and heights of places such as atmospheric layers and clouds. It is used to measure the height of an object which is above the earth’s surface. In this context, the altitude measurement is the vertical distance between the earth's surface above sea level and the sampled position in the air | measurement value | {float} {unit} | 100 meter | M | meter | 1 | 0 | MIXS:0000094 |
---|
Why is this? Why not use SI or UCUM standard units such as "m".
Recommending 'meter' and the like will lead to orthographic errors e.g. meters, metres, metre, meter. It is not very international as other languages will have different labels.
It seems that most people have ignored MIXS here anyway. When I look at depth in INSDC, I see ~3k samples have a depth field of the form number m
; ~1k of the form number cm
. In contrast there are ~300 samples using some variant of (meter|meter)[s]
@kaiam should we include a reference to the unit URIs you have been working on
Sorry I didn't see this before I wasn't tagged properly, see my comments in https://github.com/GenomicsStandardsConsortium/mixs/issues/154.
Some options:
Would be great to be consistent across all terms.