How to include units in MIxS terms

ramonawalls commented 3 years ago

Some options:

Specify the unit and include it in the term label
Specify the unity but don't include it in the term label. Include in definition.
Allow different units and include unit in the value
Allow different units and have a separate unit field

Would be great to be consistent across all terms.

cmungall commented 3 years ago

is this really particular to mixs-rdf or a general mixs question?

1 or 2 :I floated this in https://github.com/GenomicsStandardsConsortium/mixs/issues/56 but this was closed.
- For 1 I think you mean include in label AND definition
- I think 2 would be incredibly dangerous
3: this is the status quo, correct?

For anything other than 3 there would need to be a migration plan for existing sample metadata e.g in INSDC

msweetlove commented 3 years ago

Having a fixed unit is only feasible for some fields that are usually highly standardized, like depth in meters. But there will inevitably be fields where freedom on units will be required (like gram per liter vs. gram per gram soil, which cannot be converted one into the other).

for option 3 => this may include a high chance of typo's in the files provided by users, and a lot of effort/difficulties to parse the files for people that want to use the data.

option 4 sounds better, but what about solving this problem like they did with the DarwinCore archive structure: by using multiple files. A core file could be the MIxS file with the (meta-) data, and an extension could be a list of all the fields in the core file with their unit or a variable that describes the value of that field (e.g. boolean, alphanumerical,...).

cmungall commented 3 years ago

Having a fixed unit is only feasible for some fields that are usually highly standardized, like depth in meters. But there will inevitably be fields where freedom on units will be required (like gram per liter vs. gram per gram soil, which cannot be converted one into the other).

good point. So 1 or 2 could not be adapted across all fields. A possibility is a hybrid, with some fields like depth following 1 or 2, and others following 3 or 4. But I think this would be confusing.

for option 3 => this may include a high chance of typo's in the files provided by users, and a lot of effort/difficulties to parse the files for people that want to use the data.

Option 3 is the status quo, and indeed we see lots of junk in submitted data. But I'm not sure this is solved by any of the other options. If people are going to enter junk, they will enter junk.

If it is entered correctly it is not so hard to parse {float} {unit} into a normalized representation, and this could be done centrally

what about solving this problem like they did with the DarwinCore archive structure: by using multiple files. A core file could be the MIxS file with the (meta-) data, and an extension could be a list of all the fields in the core file with their unit or a variable that describes the value of that field (e.g. boolean, alphanumerical,...).

sorry, I'm not totally following what you mean. Can you give a specific example, e.g a water sample taken at 100m?

msweetlove commented 3 years ago

If it is entered correctly it is not so hard to parse {float} {unit} into a normalized representation, and this could be done centrally

I'm a bit afraid for this option the separator between {float} and {unit} may be a problem: some will use a space, others may use a tab, some people don't put a separator in between... Also having special characters (e.g. "\" and "-" when people abbreviate units) in between numeric measurements is something that can cause trouble in commonly used software to locally store or analyze the data (e.g. excel, R,...)

sorry, I'm not totally following what you mean. Can you give a specific example, e.g a water sample taken at 100m?

In DarwinCore, there is a central data file with an ID per sample, and extension files with additional data of that sample (e.g. the depth measurement). Combined, these files are called a DarwinCore archive. I was thinking along these lines for the units in MIxS: have you central data file with the water sample and the measurement values linked to MIxS terms (e.g. project_name= prj_1, lat_lon= 66.4 123.7 and depth= 100). Associated with that central file have a 2-column extension file with the MIxS terms that were used (so: project_name, lat_lon and depth) and their respective units (in this example: alphanumeric, decimal degree and meter).

cmungall commented 3 years ago

I'm a bit afraid for this option the separator between {float} and {unit} may be a problem: some will use a space, others may use a tab, some people don't put a separator in between...

Indeed! If you look at what is currently in NCBI/EBI BioSamples for the 'depth' field you will find things like:

N40.1164_W88.2543
25 santimeters
0 – 20 cm
3.149
30-60cm replicate6
1800, 1800
30ft
5m, 32m, 70m, 110m, 200m, 320m, 1000m
Surface soil from deep water
0 m water depth
Metamorph4 (19dpf) biological replicate 3

These are examples of the actual raw value for the depth field

In principle it is possible to detect these errors upstream, at time of submission, and suggest repairs, etc. MIxS provides regex-like structures for every field. In practice this has not been done.

But I am afraid that we'd still get junk if we used a different schema too

Also having special characters (e.g. "" and "-" when people abbreviate units) in between numeric measurements is something that can cause trouble in commonly used software to locally store or analyze the data (e.g. excel, R,...)

True, but perhaps we should treat exchange format from analysis format as separate concerns.

IF (and this is a big if) we can have field values adhere to standards such as {float} {unit} then we can have easy to use, fast, simple tools that will

replace a field such as depth to depth_in_meters (where meter is the prefered unit; alternatively to a sane unit of choice)
- e.g. "1000 m" -> 1000
- e.g. "1 km" -> 1000

In DarwinCore, there is a central data file with an ID per sample, and extension files with additional data of that sample (e.g. the depth measurement). Combined, these files are called a DarwinCore archive. I was thinking along these lines for the units in MIxS: have you central data file with the water sample and the measurement values linked to MIxS terms (e.g. project_name= prj_1, lat_lon= 66.4 123.7 and depth= 100). Associated with that central file have a 2-column extension file with the MIxS terms that were used (so: project_name, lat_lon and depth) and their respective units (in this example: alphanumeric, decimal degree and meter).

I see. I think this makes sense for DarwinCore, where you have more control over the information ecosystem.

My impression for MIxS (and I am new and not an authority) is that we have less control and that the majority of users work with a excel files and no separate data dictionary, our best bet is to standardize the columns - but others with more history on the project and experience of both MIxS and DarwinCore and other systems may have other perspectives.

I think ideally we would have a native json representation where measurements are fully normalized, and there are optional additional fields for precision, provenance, and so on. This is actually the approach we take for NMDC, where we may have

{id: SAM123,
  depth: {raw_value: "1000 m",
               unit: "m",   ### this would map to UO meters via a JSON-LD context
               value: 1000,
               precision: ...,   ## we don't actually do this but it would be easy to extend this way
               was_generated_by: .... ## prov process, e.g. using ORNL Identify tool, or assigned by a metadata curator, ...
              },
   ...
}

of course, we still live in a world of spreadsheets, this is what users like, but we would have a well-defined mapping between a spreadsheet/tsv representation and the structured representation.

See MIxS ObjectProperties proposal

ramonawalls commented 3 years ago

Moving this issue to the main. MIxS repo.

cmungall commented 3 years ago

In so far as a user supplies a unit, rather than it being baked it, can we give better guidance, formalize this more in the schema, and make better use of international standards.

For example, right now we recommend "meter" as the unit for depth.

e.g

air	alt	altitude	Altitude is a term used to identify heights of objects such as airplanes, space shuttles, rockets, atmospheric balloons and heights of places such as atmospheric layers and clouds. It is used to measure the height of an object which is above the earth‚Äôs surface. In this context, the altitude measurement is the vertical distance between the earth's surface above sea level and the sampled position in the air	measurement value	{float} {unit}	100 meter	M	meter	1	0	MIXS:0000094

Why is this? Why not use SI or UCUM standard units such as "m".

Recommending 'meter' and the like will lead to orthographic errors e.g. meters, metres, metre, meter. It is not very international as other languages will have different labels.

It seems that most people have ignored MIXS here anyway. When I look at depth in INSDC, I see ~3k samples have a depth field of the form number m; ~1k of the form number cm. In contrast there are ~300 samples using some variant of (meter|meter)[s]

@kaiam should we include a reference to the unit URIs you have been working on

kaiiam commented 3 years ago

Sorry I didn't see this before I wasn't tagged properly, see my comments in https://github.com/GenomicsStandardsConsortium/mixs/issues/154.

GenomicsStandardsConsortium / mixs

How to include units in MIxS terms #93