GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
38 stars 21 forks source link

Include units in the column header #14

Closed cuttlefishh closed 2 months ago

cuttlefishh commented 6 years ago

Units are expected to be included in the same field as the values, according to the “Value Syntax” field of the MIxS standard. If this were actually followed, it would make processing of spreadsheets tricky. In the majority of records, this instruction is not followed, and no units are provided at all. I (Luke) would like to see units in the column header, e.g. 'latitude_deg', 'temperature_deg_c', 'phosphate_umol_per_l', 'salinity_psu'. This marries units to the datum so they persist across data processing, including in plots of values. It also avoids inappropriate combination of data from different studies reported with different units, a peril of meta-analysis. Pier points out that separate unit and datum fields would allow units to be controlled by the Unit Ontology. Both approaches could be implemented.

cuttlefishh commented 6 years ago

If there is interest in changing this standard, perhaps there are only certain columns where our community can agree on standard units. I would suggest the following columns would be good candidates:

only1chunts commented 6 years ago

as a user of the current standards I tend not to include units within the value field, infact in GigaDB we have a separate column for unit to be specified using UO (it rarely gets used, but its there!). I think appending the unit name to the term name its not particularly in keeping with how terms should be named, but it is a practical solution to a problem for many consumers of the metatdata. One thing to discuss would be, do we have multiple terms available for those wishing to use different units? or are we forcing everyone to use 1 particular unit (i.e. make the user convert from measured value into reporting value. For us (GigaDB) the intention was to implement a conversion tool for the most common fields within the submission system that will allow users to input in whatever units they like (specify them) and we auto convert to the preferred storage units, but we haven't got around to that yet! Finally with lat and long, just defining "deg" is almost as ambiguous as having nothing, degree can be given in minutes and seconds, or in decimal degrees. I would suggest we append dd (for decimal degrees) instead of deg.

lschriml commented 6 years ago

Good morning, As we specify the units in another column, I don’t think we need them in the metadata term as well. Also, this would break the submissions for BioSample. I would vote not to append the units.

Cheers, Lynn

GSC President

Sent from my iPhone

On May 11, 2018, at 5:04 AM, Chris Hunter notifications@github.com wrote:

as a user of the current standards I tend not to include units within the value field, infact in GigaDB we have a separate column for unit to be specified using UO (it rarely gets used, but its there!). I think appending the unit name to the term name its not particularly in keeping with how terms should be named, but it is a practical solution to a problem for many consumers of the metatdata. One thing to discuss would be, do we have multiple terms available for those wishing to use different units? or are we forcing everyone to use 1 particular unit (i.e. make the user convert from measured value into reporting value. For us (GigaDB) the intention was to implement a conversion tool for the most common fields within the submission system that will allow users to input in whatever units they like (specify them) and we auto convert to the preferred storage units, but we haven't got around to that yet! Finally with lat and long, just defining "deg" is almost as ambiguous as having nothing, degree can be given in minutes and seconds, or in decimal degrees. I would suggest we append dd (for decimal degrees) instead of deg.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

pyilmaz commented 6 years ago

We don't specify a units column, I believe that's just the implementation in GigaDB. MIxS suggests preferred units for a metadata term value, and they're supposed to be all in one line, i.e 30 degree Celsius, 5 meter...

lschriml commented 6 years ago

Would it help users, for us to specify the preferred unit in label ? e.g. air temperature (degree Celsius) ?

Cheers, Lynn

On May 11, 2018, at 8:16 AM, pyilmaz notifications@github.com wrote:

We don't specify a units column, I believe that's just the implementation in GigaDB. MIxS suggests preferred units for a metadata term value, and they're supposed to be all in one line, i.e 30 degree Celsius, 5 meter...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GenomicsStandardsConsortium/mixs/issues/14#issuecomment-388346962, or mute the thread https://github.com/notifications/unsubscribe-auth/AEIeDcSLMsVQenkarG3uigAMno59c2aHks5txYEkgaJpZM4Q3EH1.

jdeck88 commented 6 years ago

A few suggestions, based mainly on aligning with Darwin Core conventions for naming:

1) Spell out the units in names and inserts "in" to clarify intention, e.g. so "depth_m" becomes "depth_in_meters", and "altitude_m" becomes "altitude_in_meters" (e.g., see http://rs.tdwg.org/dwc/terms/#maximumElevationInMeters) 2) instead of latitude_deg, i would suggest "decimal_latitude" (and same for longitude). This is a bit different than the previous example which has the units following but aligns with Darwin Core's use of decimalLatitude, decimalLongitude. Also, Using "deg", or "degrees" is vague as degrees could be expressed in many different forms, whereas "decimal" signifies one form that is the most computable. Honestly, if folks were to insert all potential variants of degrees the field would likely be impossible to parse.
3) Instead of "temperature_deg_c" i would suggest "temperature_in_celsius" and not use degrees in this case 4) Finally i prefer camel case because it keeps names more readable (at least to me) and shorter but understand keeping with a particular convention!

ramonawalls commented 6 years ago

Ditto to everything John said. To be more bold, for latitude and longitude, we really should just reuse the Darwin Core terms.

While in principle, I agree that having the units as a separate field is better, I understand that it may not be practical at this point. The solution proposed here does leave open the possibility we will need to add new terms in the future for other units, but we should stick to asking people to convert their data to the recommended units. If we go with the combined value/unit field, those can always be easily parsed in future ontological or data-based implementations.

tucotuco commented 6 years ago

Indeed, any reason not to adopt the Darwin Core terms exactly when they correspond? That would bring the communities closer and provide vetted, managed def and potentially also corresponding tools.

Darwin Core was based on Dublin Core, the convention for which is to have term names for properties in lowerCamelCase term names for Classes in UpperCamelCase. Their namespace policy (http://www.dublincore.org/documents/dcmi-namespace/) doesn't state this explicitly, but it does show examples and the terms definitely follow this pattern.

lschriml commented 6 years ago

Lets discuss this at our June working group meeting.

Also, include ENA and NCBI in the discussion.

Cheers, Lynn

On May 11, 2018, at 1:19 PM, John Wieczorek notifications@github.com wrote:

Indeed, any reason not to adopt the Darwin Core terms exactly when they correspond? That would bring the communities closer and provide vetted, managed def and potentially also corresponding tools.

Darwin Core was based on Dublin Core, the convention for which is to have term names for properties in lowerCamelCase term names for Classes in UpperCamelCase. Their namespace policy (http://www.dublincore.org/documents/dcmi-namespace/ http://www.dublincore.org/documents/dcmi-namespace/) doesn't state this explicitly, but it does show examples and the terms definitely follow this pattern.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GenomicsStandardsConsortium/mixs/issues/14#issuecomment-388428663, or mute the thread https://github.com/notifications/unsubscribe-auth/AEIeDVUbMzu5-SnmIksxqujx2jGwLmaVks5txcgugaJpZM4Q3EH1.

ikostadi commented 6 years ago

Hi, first of all, apologies for not being present at the development telecons. My perspective is that of a data broker - users come to us seeking help to deposit sequence data into ENA. Therefore, we are mostly using the ENA implementation of MIxS. I am sure the ENA Team will comment on this as well. I recognize the practical problem described by @cuttlefishh. However, I am definitely against including units in the label. Some parameters allow several different units. Applying different label rules to a subset of parameters will cause an extra implementation effort for existing infrastructures. Also, this would impose a certain limitation to alternative implementations of the checklists. Instead, I believe we should investigate the reasons why the units are not supplied in the first place. I can imagine a solution where the unit may be included in the header of a table (e.g. after a specified separator character) but is not part of the parameter label (e.g. 'depth#meter'); if a unit is also specified as part of the value (and is acceptable) it could/should take precedence. I hope I can join the live discussion in June.

Best, Ivo

Edit: What I wanted to say above was basically that, (IMHO) the implementation(s) should follow the specification and not the other way around.

ramonawalls commented 3 years ago

See also https://github.com/GenomicsStandardsConsortium/mixs-rdf/issues/20. We would like to move to single required unit per term.

mslarae13 commented 2 months ago

I think this can be closed as OBE / Not planned.

LinkML implementation has provided the ability to structure slots/terms a QuantityValue with a value and unit. We have ranges and patterns to enforce having a value & unit. When a specific unit is required, it's specified by either the description or in the regex.

The structure of the MIxS excel file will be structured using the mixs.yaml source file & are Ramona indicated, there are preferred units where applicable.

Implementation of the standard via NCBI (INSDC) vs how MIxS validates may be variable.

@lschriml can you provide any additional insight here about if this is something GSC would implement, or considering the advances we've made in LinkML, this is OBE.