GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
36 stars 21 forks source link

measurement value ranges #166

Open wdduncan opened 3 years ago

wdduncan commented 3 years ago

For the NMDC, we have a number of samples for which the value of the depth is given as a range. We should discuss how to best handle this. For example, one approach is to adopt DWC min depth, max depth, verbatim depth terms. But, we need to discuss other approach.

Similar range needs also apply to fields like temperature and elevation.

cc @ramonawalls

wdduncan commented 3 years ago

The need for a depth range comes about when collecting soil samples. The soil being analyzed is taken from a homogenized mix of the soil between two ranges of the core sample. Note, in other samples (such as water), you can specify a specific depth.

Here are some options:

  1. Have 3 terms for depth. depth, minimum_depth, maximum_depth.
  2. Just have two terms for depth: minimum_depth, maximum_depth. If the depth is a single value, then minimum_depth is set to be the same asmaximum_depth. This how time ranges (e.g., start and end date) are sometimes handled in other systems.
  3. Since we are migrating towards managing mixs in linkml, we can create specialized min value and max value properties to modify a depth value. In JSON, it would look something like this:
    "depth": 
    { 
    "min value": 1,
    "max value": 5
    }

    Of the tree options, I'm a bit partial to 2. Although, it comes with some headaches. E.g. What do we with current depth term? Do we keep it around (which in effect amounts to using option 1)? Do drop it and provide guidance on how to migrate?

3 is interesting, but this may be problematic for folks used to filling out spreadsheets.

cc @only1chunts @cmungall @dehays @lschriml

wdduncan commented 3 years ago

Another note:

Darwin Core also min/max depth values that we may be able to make use of: minimumDepthInMeters, maximumDepthInMeters

cmungall commented 3 years ago

@wdduncan - great summary

Regarding your option 3. Regardless of how the schema is implemented, the majority of instance data is in spreadsheets, not JSON. So I would frame your example in terms of what the string serialization would be, which would be modeled in a normalized database in the appropriate way

So I would state your options as

  1. 3 terms
  2. 2 terms: min and max
  3. keep the existing field and allow/force ranges
    • 3a. Syntax "NUMBER-NUMBER UNIT"
    • 3b. Syntax "NUMBER[-NUMBER] UNIT"
cmungall commented 3 years ago

The proposal also has to address backwards and forwarrds compatibility.

If 2 or 3a is chosen, what do we do with existing data that is a single value? Do we create depth=min=max?

wdduncan commented 3 years ago

It may be best to go with option 3b "NUMBER[-NUMBER] UNIT" and let the vendors implement field as they see fit; e.g, having a min and max fields in the sample database table.

only1chunts commented 3 years ago

There is already a great deal of variation in the usage of the term depth, I just did a quick and dirty search, out of ~125k soil samples with a depth field in BioSamples, ~62k include a hyphen "-" within the value, suggesting there is already a fairly large usage of "NUMBER-[NUMBER]" type values. so for backward compatibility, I think option 3b looks most reasonable.

cmungall commented 3 years ago

OK, seems like we are in agreement on 3b.

Based on discussion with some of our scientists, I would also like language that ranges are preferred over unitary values. I would use ISO language here, e.g.

""range SHOULD be specified as a range delimited by a hyphen. However, in cases where the range is not known, this MAY be specified as a unitary value"

ramonawalls commented 3 years ago

I am on board with option 3b as well. Let's finalize on Monday.

@raissameyer you should be aware of this as it may impact your mapping.

cmungall commented 3 years ago

FWIW, here's the top values for depth in INSDC

count value
61393 0
14776 not applicable
9890 missing
9572 0.1
8397 0.01
5501 0-10 cm
5066 surface
4280 0-20 cm
4151 10 cm
4107 5
3985 0-10cm
3920 NA
3900 not collected
3725 0-20cm
3509 0.0
3323 20 cm
3201 0-15cm
3193 1
3079 10
2944 1m
2877 0 m
2632 0-5 cm
2631 1-10cm
2601 15cm
2443 0.05
2307 5-1000m
2238 0.2
2180 5cm
2163 20
2097 0.5
2096 5 cm
2037 0.1 m
2022 10cm
1981 0.05m
1901 20cm
1883 0-0.1
1867 0m
1666 Unknown
1506 0.3
1486 0.5m
1439 1 m
1433 0.025
1412 50fsw
1395 2 m
1352 3
1342 15 cm
1336 0.01 m
1318 0-15 cm
1141 [0m-40m]
1114 5m
1084 30
ramonawalls commented 3 years ago

Discussed on call on Aug. 9 and agreed on 3B

ramonawalls commented 3 years ago

Update other similar terms.

ramonawalls commented 3 years ago

TODO: change syntax to "NUMBER[-NUMBER] UNIT" for depth.

Leave this issue open for MIxs7. Some fields should have a range, whereas some should have errors.

wdduncan commented 3 years ago

We need to consider cases in which negative numbers are used (e.g., temps below freezing).

Use cases I can think of:

Are these examples clear? Are the parens too confusing?

only1chunts commented 2 years ago

all great stuff, but too much to implement in v6, so I am removing this ticket from the v6 project and labelling with v7 discussion label.

mslarae13 commented 2 years ago

We need to consider cases in which negative numbers are used (e.g., temps below freezing).

Use cases I can think of:

  • A single negative number (e.g., -10 C). The - needs to interpreted correctly.
  • A two negative numbers (e.g., -10 to -20 C). Should we require the second number to be in parens (e.g., -10-(-20) C)
  • A positive and a negative number (e.g., 5 to -10 C or -10 to 5 C). Using parens, the first would look like 5-(-10) C. The second would simply be -10-5 C.

Are these examples clear? Are the parens too confusing?

To add some context to Bills recommendation for negative values. One use case happens in peatland. In this ecosystem there's undulation. "Lower" sections called the hollows and "raised" sections called hummocks. When sampling soil, "distance from the surface" isn't always relative. So, 0-10cm from the surface of the hollow is the parallel depth as 10-20cm from the surface of the hummock. In the case of researched I've been involved in, to work around this "like depth, different location" issues. we added -0-10 as "distance below the surface of the hollow", and +0+10 as distance above the surface of the hollow and into the hummock. This also keeps all subsequent depths aligned. Here's an image to hopefully help detail this. : https://drive.google.com/file/d/1Tbwadh1hvLQqtGEFKOVZAY1iZESPXtQx/view?usp=sharing

Also, sometimes, even if not relevant or needed, people will include -0-10 vs 0-10, even if it's the same thing.

image

wdduncan commented 2 years ago

Proposal on 2022-07-26: Break up value in to atomic fields: e.g. one field each for:

How does this affect user experience and tools to parse data?

We need to consider if it is best to simply have start and end fields, with those being equal for single point cases.

mslarae13 commented 2 years ago

I'm not sure I understand "start begin" & "depth end". Are you saying separate the depth values when there's 2 (soil, sediment) and use begin and end.. and depth when there's only 1 (water)

User experience, it's another column in an already wide sheet. BUT might bring their attention to "this should be a range" & make validation easier.

Note for NMDC, unit isn't needed. We will require meters.

wdduncan commented 2 years ago

start begin

Sorry, that was a typo. The approach advocated by @pbuttigieg would be to have generic fields such as:

For non-range measurements, the range start and range end values would be the same.

pbuttigieg commented 2 years ago

Thanks @wdduncan

Recalling the overall goal is to avoid having to write custom code to parse syntax in a data standard (values should be as simple as possible):

I'm actually impartial to whether there are range fields alone or accompanied by a point measurement field. The concern that this would be confusing for some prompted the suggestion of using only range fields and instructing users to enter identical begin/end values.

DwC's verbatim fields are handy for legacy data or data gathered in non-machine-friendly ways (scrawlings in a field notebook, "...the creature was retrieved from about half an arm's length deep")

Further:

As discussed in previous CIG calls on atomisation and improved actionability, as well as at the last board meeting, I would leave out the "unit" field, instead requiring standard units (e.g. meters) in each field.

There is too much variation in the units used, no validation of what's entered, and no stable way to autoconvert between units.

wdduncan commented 2 years ago

In some of the software systems I've worked with, the software would automatically set the range end value equal to the ranger start in cases where only a single value was required. I don't think this is a major impetus to having both range start/end fields, but the guidance for how to use them need to be clear.

I think it is reasonable to have the unit field. Not everyone works in units of meters. We may require that the unit come from standardized source, though.