Open wdduncan opened 3 years ago
The need for a depth range comes about when collecting soil samples. The soil being analyzed is taken from a homogenized mix of the soil between two ranges of the core sample. Note, in other samples (such as water), you can specify a specific depth.
Here are some options:
depth
, minimum_depth
, maximum_depth
.minimum_depth
, maximum_depth
. If the depth is a single value, then minimum_depth
is set to be the same asmaximum_depth
. This how time ranges (e.g., start and end date) are sometimes handled in other systems.min value
and max value
properties to modify a depth value. In JSON, it would look something like this:
"depth":
{
"min value": 1,
"max value": 5
}
Of the tree options, I'm a bit partial to 2
. Although, it comes with some headaches. E.g. What do we with current depth
term? Do we keep it around (which in effect amounts to using option 1
)? Do drop it and provide guidance on how to migrate?
3
is interesting, but this may be problematic for folks used to filling out spreadsheets.
cc @only1chunts @cmungall @dehays @lschriml
Another note:
Darwin Core also min/max depth values that we may be able to make use of: minimumDepthInMeters, maximumDepthInMeters
@wdduncan - great summary
Regarding your option 3. Regardless of how the schema is implemented, the majority of instance data is in spreadsheets, not JSON. So I would frame your example in terms of what the string serialization would be, which would be modeled in a normalized database in the appropriate way
So I would state your options as
The proposal also has to address backwards and forwarrds compatibility.
If 2 or 3a is chosen, what do we do with existing data that is a single value? Do we create depth=min=max?
It may be best to go with option 3b "NUMBER[-NUMBER] UNIT"
and let the vendors implement field as they see fit; e.g, having a min and max fields in the sample database table.
There is already a great deal of variation in the usage of the term depth, I just did a quick and dirty search, out of ~125k soil samples with a depth field in BioSamples, ~62k include a hyphen "-" within the value, suggesting there is already a fairly large usage of "NUMBER-[NUMBER]" type values. so for backward compatibility, I think option 3b looks most reasonable.
OK, seems like we are in agreement on 3b.
Based on discussion with some of our scientists, I would also like language that ranges are preferred over unitary values. I would use ISO language here, e.g.
""range SHOULD be specified as a range delimited by a hyphen. However, in cases where the range is not known, this MAY be specified as a unitary value"
I am on board with option 3b as well. Let's finalize on Monday.
@raissameyer you should be aware of this as it may impact your mapping.
FWIW, here's the top values for depth in INSDC
count | value |
---|---|
61393 | 0 |
14776 | not applicable |
9890 | missing |
9572 | 0.1 |
8397 | 0.01 |
5501 | 0-10 cm |
5066 | surface |
4280 | 0-20 cm |
4151 | 10 cm |
4107 | 5 |
3985 | 0-10cm |
3920 | NA |
3900 | not collected |
3725 | 0-20cm |
3509 | 0.0 |
3323 | 20 cm |
3201 | 0-15cm |
3193 | 1 |
3079 | 10 |
2944 | 1m |
2877 | 0 m |
2632 | 0-5 cm |
2631 | 1-10cm |
2601 | 15cm |
2443 | 0.05 |
2307 | 5-1000m |
2238 | 0.2 |
2180 | 5cm |
2163 | 20 |
2097 | 0.5 |
2096 | 5 cm |
2037 | 0.1 m |
2022 | 10cm |
1981 | 0.05m |
1901 | 20cm |
1883 | 0-0.1 |
1867 | 0m |
1666 | Unknown |
1506 | 0.3 |
1486 | 0.5m |
1439 | 1 m |
1433 | 0.025 |
1412 | 50fsw |
1395 | 2 m |
1352 | 3 |
1342 | 15 cm |
1336 | 0.01 m |
1318 | 0-15 cm |
1141 | [0m-40m] |
1114 | 5m |
1084 | 30 |
Discussed on call on Aug. 9 and agreed on 3B
Update other similar terms.
TODO: change syntax to "NUMBER[-NUMBER] UNIT" for depth.
Leave this issue open for MIxs7. Some fields should have a range, whereas some should have errors.
We need to consider cases in which negative numbers are used (e.g., temps below freezing).
Use cases I can think of:
-10 C
). The -
needs to interpreted correctly.-10 to -20 C
). Should we require the second number to be in parens (e.g., -10-(-20) C
)5 to -10 C
or -10 to 5 C
). Using parens, the first would look like 5-(-10) C
. The second would simply be -10-5 C
.Are these examples clear? Are the parens too confusing?
all great stuff, but too much to implement in v6, so I am removing this ticket from the v6 project and labelling with v7 discussion label.
We need to consider cases in which negative numbers are used (e.g., temps below freezing).
Use cases I can think of:
- A single negative number (e.g.,
-10 C
). The-
needs to interpreted correctly.- A two negative numbers (e.g.,
-10 to -20 C
). Should we require the second number to be in parens (e.g.,-10-(-20) C
)- A positive and a negative number (e.g.,
5 to -10 C
or-10 to 5 C
). Using parens, the first would look like5-(-10) C
. The second would simply be-10-5 C
.Are these examples clear? Are the parens too confusing?
To add some context to Bills recommendation for negative values. One use case happens in peatland. In this ecosystem there's undulation. "Lower" sections called the hollows and "raised" sections called hummocks. When sampling soil, "distance from the surface" isn't always relative. So, 0-10cm from the surface of the hollow is the parallel depth as 10-20cm from the surface of the hummock. In the case of researched I've been involved in, to work around this "like depth, different location" issues. we added -0-10 as "distance below the surface of the hollow", and +0+10 as distance above the surface of the hollow and into the hummock. This also keeps all subsequent depths aligned. Here's an image to hopefully help detail this. : https://drive.google.com/file/d/1Tbwadh1hvLQqtGEFKOVZAY1iZESPXtQx/view?usp=sharing
Also, sometimes, even if not relevant or needed, people will include -0-10 vs 0-10, even if it's the same thing.
Proposal on 2022-07-26: Break up value in to atomic fields: e.g. one field each for:
How does this affect user experience and tools to parse data?
We need to consider if it is best to simply have start and end fields, with those being equal for single point cases.
I'm not sure I understand "start begin" & "depth end". Are you saying separate the depth values when there's 2 (soil, sediment) and use begin and end.. and depth when there's only 1 (water)
User experience, it's another column in an already wide sheet. BUT might bring their attention to "this should be a range" & make validation easier.
Note for NMDC, unit isn't needed. We will require meters.
start begin
Sorry, that was a typo. The approach advocated by @pbuttigieg would be to have generic fields such as:
For non-range measurements, the range start
and range end
values would be the same.
Thanks @wdduncan
Recalling the overall goal is to avoid having to write custom code to parse syntax in a data standard (values should be as simple as possible):
I'm actually impartial to whether there are range fields alone or accompanied by a point measurement field. The concern that this would be confusing for some prompted the suggestion of using only range fields and instructing users to enter identical begin/end values.
DwC's verbatim fields are handy for legacy data or data gathered in non-machine-friendly ways (scrawlings in a field notebook, "...the creature was retrieved from about half an arm's length deep")
Further:
As discussed in previous CIG calls on atomisation and improved actionability, as well as at the last board meeting, I would leave out the "unit" field, instead requiring standard units (e.g. meters) in each field.
There is too much variation in the units used, no validation of what's entered, and no stable way to autoconvert between units.
In some of the software systems I've worked with, the software would automatically set the range end value equal to the ranger start in cases where only a single value was required. I don't think this is a major impetus to having both range start/end fields, but the guidance for how to use them need to be clear.
I think it is reasonable to have the unit field. Not everyone works in units of meters. We may require that the unit come from standardized source, though.
For the NMDC, we have a number of samples for which the value of the depth is given as a range. We should discuss how to best handle this. For example, one approach is to adopt DWC min depth, max depth, verbatim depth terms. But, we need to discuss other approach.
Similar range needs also apply to fields like temperature and elevation.
cc @ramonawalls