Closed schristley closed 4 years ago
The other option would be to have a "controlled" vocabulary for this in the sense that for the string field to be valid (according to the standard) it needs to be in a specific format with a number and a unit. From an API perspective, one could still describe range queries on the object (give me all ages > "40 yr"), but this would mean that the implementation of the combined API service and repository would need to handle the comparison. From a repository perspective, we would almost certainly split this up into two fields as Scott suggests, but that doesn't necessarily mean that the API does.
FWIW, the other topic raised in the other thread is age ranges. Studies often provide age information in this way and there is no way to get an exact age. Internally, iReceptor Repositories already have an extension that stores age range (age in years represented by a floating point number in our case) so we can search on this.
The other option would be to have a "controlled" vocabulary for this in the sense that for the string field to be valid (according to the standard) it needs to be in a specific format with a number and a unit.
I would consider that more like a new "data type" versus a controlled vocabulary because all programs end-to-end would have to parse and interpret that field in a special way differently from the base data types like integers, strings and numbers. OpenAPI doesn't have a mechanism to specify custom formats like that, so we would have to define (and somehow enforce) this ourselves, which seems like added complexity. The two field proposal allows standard data types to be used.
MiniStd's take on this one was always that the MiAIRR "Age" field holds a physical quantity and everything else is considered formating (we discussed this at least twice). However, I understand that there should be an simple way to query on this information. Also, we did not think too much about ranges.
Let's tackle the representation in the schema first: Would it be sufficient if we had three keys age_min
, age_max
and age_unit
?
It's perfect reasonable that UIs or data submission forms can combine the multiple fields into a formatted physical quantity age
field for MiAIRR.
For ranges, I was thinking of being able to query with a range, but I suppose it is also relevant to specify a range? If yes, In the case when a single age is specified versus a range, it needs to be clear which field to use to hold the value, or both age_min
and age_max
must have the same value.
Doing a range query against a single value, versus doing a range query against another range, are two different queries. They require a different logical expression. Users would have to know ahead of time which one to use, or we have to define the fields in a way so that a single logical expression works for all cases.
Let's tackle the representation in the schema first: Would it be sufficient if we had 3 keys age_min, age_max and age_unit?
Should work. If we assume that age_min
is used when specifying a single age value and age_max
should be null, if age_max
is not null then the age is assumed to be a range. The following expression should work for both when querying for a single age value (query_age).
((age_max is missing) and (age_min = query_age))
or ((age_max is not missing) and (age_min <= query_age) and (age_max >= query_age))
If querying for a range (query_age_min, query_age_max) then this expression can be used.
((age_max is missing) and (age_min >= query_age_min) and (age_min <= query_age_max))
or ((age_max is not missing) and (age_min <= query_age_max) and (age_max >= query_age_min))
The age
field can be kept as a MiAIRR reporting field.
look good? @bcorrie @bussec
Yes, this seems reasonable to us... This is essentially what we are already doing... The only difference is that it might be a good idea to suggest (require?) that if an explicit age is given that age_min == age_max. Then you can simply use (age_min <= query_age) and (age_max >= query_age) and don't have to check the special case of one of them being missing...
The only difference is that it might be a good idea to suggest (require?) that if an explicit age is given that age_min == age_max.
That seems reasonable. I initially avoided because I wasn't sure how code would know if the data represented a single age or a range, but I guess they can check if age_min = age_max
.
However, if we do, we need to "require" it, otherwise data entry errors will cause queries to fail, that is, all code everything which sets the age must remember to set both fields. Unfortunately there isn't a way to specify this in the schema so it would need to be documented, seems fragile...
Maybe instead of age_min
and age_max
, we name the fields age_magnitude
and age_range
. Then age_magnitude
is the single value and becomes the lower bound for a range with age_range
being the upper bound. Suggestions on other names?
But I tend not to like having fields which can be interpreted in multiple ways. Maybe we should be more explicit and separate the range and magnitude fields with age_magnitude
, age_min
, age_max
.
age_magnitude
and age_range
idea in general, IMO it makes more sense to define
age_magnitude
as the mean of age_min
and age_max
and age_range
as (age_max
-age_min
)/2age_unit
should hold an SI compliant unit symbol (i.e. "a", not "y" or "yr").I would like to propose that age_unit should hold an SI compliant unit symbol (i.e. "a", not "y" or "yr").
I'm not familiar with "a" as a unit of time. I'm fine with restricting to standard unit symbols. Is there a table that we should incorporate as a controlled vocabulary?
@bcorrie @bussec Have we come to a consensus on this? It would be nice to have this specified more clearly for ADC API V1
No consensus yet that I am aware of but I was thinking along the same lines - that we should try and resolve for the ADC API.
I find the use of age_magnitude a bit confusing. It doesn't really sound like a magnitude to me - magnitude usually means size... Are we talking about size? Why not just "age" and "age_range"? But then that is confusing because if you take age alone, it is misleading. Maybe age_average and age_range as that is what they truly are. But I find that a bit confusing as well...
I think it might be best to lean towards using terminology and fields that most closely map to how a an experimenter would describe this. I think they would say something like "... Subject A was female between 10 and 20 years old" => age_min, age_max, age_unit? Correct me if I am wrong...
The underlying issue is how to represent either a point value or an interval, and then allow queries which may either be a point value or interval.
Query with a point value "age":30
could have two interpretations:
Query with an interval requires
Technically, you could specify how the intervals overlap, like one interval completely encloses the other...
So those are the queries we need to support, what is the best way to structure the data so they can be easily supported.
Let's assume we have fields age
(which represents the point value, or the range minimum) and age_range_max
, the queries (in pigdin json) would be:
For query (1), except this doesn't exclude records which have a range where the lower bound is exactly the age, e.g. [30,50]
"age":30
For query (2)
"age", ">=", 30, "and", "age_range_max", "<=", 30
For query (3) with interval [20,40], here is the proper query if the record has a point value
"age", ">=", 20, "and", "age", "<=", 40
For query (4) with interval [20,40], here is the proper query (interval overlap) if the record has an interval
"age", ">=", 20, "and", "age", "<=", 40
"or"
"age_range_max", ">=", 20, "and", "age_range_max", "<=", 40
so the queries almost work out except for (1).
If we don't want to have that query (1) handled improperly, then we need to have separate fields for a point value versus an interval.
Let's assume we have fields age
(which represents the point value) and age_range_min
and age_range_max
(which represent an interval), the queries (in pigdin json) would be:
For query (1), this properly
"age":30
For query (2)
"age":30
"or"
"age_range_min", ">=", 30, "and", "age_range_max", "<=", 30
For query (3) with interval [20,40], here is the proper query if the record has a point value
"age", ">=", 20, "and", "age", "<=", 40
For query (4) with interval [20,40], here is the proper query (interval overlap) if the record has an interval
"age", ">=", 20, "and", "age", "<=", 40
"or"
"age_range_min", ">=", 20, "and", "age_range_min", "<=", 40
"or"
"age_range_max", ">=", 20, "and", "age_range_max", "<=", 40
So my conclusion is that we need three fields.
Sounds good to me, just to make sure that I understood the implication for annotators:
age
while age_range_*
stay NULLage_range_*
keys but not age
Meaning that if all three keys are NOT NULL
something is wrong.
And coming back to your earlier suggestion, I agree that would would make sense to have a controlled vocabulary for units and their abbreviations.
OK, my head hurts consider all of those options. 8-) Using my suggestion of age_min and age_max, where for a point value we require age_min = age_max, I think the concept is more intuitive (you are describing a min and a max for the age range) and the logic is easier... The only down side is the case where you have to have both min and max set. If you wanted to support one of them being null (indicating a point age), then the calculation of the other can be done when you test for equality.
For query (1), want all data that is exactly 30
30 >= age_min && 30 <= age_max
For query (2), want all data where the exact query number (30) is within the range
30 >= age_min && 30 <= age_max (same as above)
For query (3) with interval [20,40], assuming record has an point age (e.g. age_min == age_max == 30), here is the proper query
40 >= age_min && 20 <= age_max
For query (4) with query interval [20,40], assuming record has a range (e.g. age_min == 10 age_max == 30), here is the proper query
40 >= age_min && 20 <= age_max (same as above)
If I look at the logic, I think this works - given the possibilities below:
Age range possibilities
age_min age_max
10 30 (interval overlap low == True)
30 45 (interval overlap high == True)
30 35 (query interval contains age interval == True)
10 50 (query interval contained by age interval == True)
10 15 (query interval outside age interval low == False)
45 50 (query interval outside age interval high == False)
So the logic is much simpler. There are two cases, if the query is a point query you do this:
30 >= age_min && 30 <= age_max
and if the query is a range query you do this:
40 >= age_min && 20 <= age_max
If you want closed or open intervals you add/drop the = in the equality. If you want to support one of them being null rather than enforcing both, before calculating you can do:
if age_min == NULL age_min = age_max else if age_max == NULL age_max = age_min
If both are NULL you don't have an age...
Actually, I think it is simpler than that. I think a simple:
query_max >= age_min && query_min <= age_max
works for all cases if you promote a point age or point query to an interval where min==max.
Actually, I think it is simpler than that.
Brilliant!
So does that mean we want to go with age_min, age_max, and age_unit
"... Subject A was female between 10 and 20 years old" => age_min = 10, age_max = 20, age_unit = "y" or "a" (yet to be agreed on unit terminology).
With the main "complication" being exactly how to specify a point age. We could use:
Being equivalent and all implying a point value of 10 units of age. We could alternatively enforce one of these being the correct way of specifying a point age and make the others error conditions.
age_min = null, age_max = null means there is no age.
age_min != null and age_max != null and age_unit = "value" is an error (we have no age but an age_unit - this doesn't make sense)
age_min != null or age_max != null and age_unit = null is an error (we have an age but no unit - this doesn't make sense).
I suppose the only question I had is do we keep age
to be a combined display value, and the three new fields are essentially just for the API for doing query? Or are we deprecating age
?
Dare I say deprecate age? I think that makes sense for the DataRep part (the actual YAML spec) for sure.
I agree with deprecating age
and already put this change on the agenda for this week's MiniStd call.
The proposed changes were accepted by MiniStd. As this is a MiAIRR field, should we follow the (proposed) deprecation process in #248 ?
The proposed changes were accepted by MiniStd. As this is a MiAIRR field, should we follow the (proposed) deprecation process in #248 ?
I agree.
Any suggestions on how we document this well? The MiAIRR section of the doc under AIRR Standards is pretty sparse, while the data submission section is heavily weighted on NCBI implementation. Should we expand the MiAIRR section, or should we put all this under the AIRR Data Model and the schema definitions?
Do we want to suggest or specify a default for age_unit? If not, what does it mean to have an age_min or age_max without an age_unit?
Controlled vocabulary for units of age here: https://www.iau.org/publications/proceedings_rules/units/
Quantity | Unit: Name | Symbol | Value |
---|---|---|---|
time (1) | minute | min or " | 60 s |
time | hour | h | 3600 s = 60 min |
time | day | d | 86 400 s = 24 h |
time | year (Julian) | a | 31.5576 Ms = 365.25 d |
If not, what does it mean to have an age_min or age_max without an age_unit?
Wild guess: That the annotator skipped physics classes in Junior High?
Funny, but not helpful 8-)
Darn, I got it wrong again! So lets try helpful, but not funny instead:
A physical quantity without a unit is formally wrong (unless it is dimensionless, which measure of time are not). Therefore IMO the question should not be "what do we default to if people fail to provide it" but rather "how do we keep people from enter invalid data".
Looking at this, there are three more things that crossed my mind:
In #254 we specify age_min
and age_max
as integer
. Should this be number
instead (AFAIK there is no separate float data type in OpenAPI, right @schristley ? ).
We now have for the first time the situation that a single MiAIRR field ("Age") is composed of several keywords (age_min
, age_max
, age_unit
). Would this create any problem?
A very simple way of dealing with our unit discussion would be to always encode time in seconds (by definition) and leave it to the UI to sort out the conversion. We should just avoid using int32
as data type ;-). The downside of this is that it would not be human-readable anymore, so from a MiniStd perspective it is not the preferred solution, but I wanted to hear the DataRep / ComRepo ideas on this.
Controlled vocabulary for units of age here:
I believe we also want weeks (common for mice) and months. My preference is that we don't use abbreviations but have the controlled_vocabulary be the full words.
Do we want to suggest or specify a default for age_unit?
I think having a default, which is potentially wrong, is worse than leaving it blank, though if the default was something like "unspecified" then that might be okay. But I'm still not clear how that is any different from null.
If not, what does it mean to have an age_min or age_max without an age_unit?
It's a validation error. If you want, you can create some new x-airr
attributes that defines the relationship, and the validation routines can check for them. Or you can just add these checks into your repository. Or you can just document it. Or do all three.
In #254 we specify age_min and age_max as integer. Should this be number instead (AFAIK there is no separate float data type in OpenAPI
yes, they should be number
We now have for the first time the situation that a single MiAIRR field ("Age") is composed of several keywords (age_min, age_max, age_unit ). Would this create any problem?
not yet! Conceptually I don't see it as a problem as MiAIRR is (abstractly) about information while the three fields are the implementation. Technically, we might have to alter some scripts, and etc., but that's usually no big deal. We might find in the future that we want to do this for other physical quantities as well, age was just the most obvious one.
A very simple way of dealing with our unit discussion would be to always encode time in seconds
yuck!
There was earlier discussion in #27 about the age fields in MiAIRR. The actual
age
field is a physical quantity, which means it has both the magnitude and unit combined together into a text field. This makes it problematic to perform range queries on the age which is desirable for the CRWG API. Should we consider separating this field into two fields:age
andage_unit
whereage
is number andage_unit
is an enum (yr, wk, day)?