airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

separate magnitude and unit for age? #192

Closed schristley closed 4 years ago

schristley commented 5 years ago

There was earlier discussion in #27 about the age fields in MiAIRR. The actual age field is a physical quantity, which means it has both the magnitude and unit combined together into a text field. This makes it problematic to perform range queries on the age which is desirable for the CRWG API. Should we consider separating this field into two fields: age and age_unit where age is number and age_unit is an enum (yr, wk, day)?

bcorrie commented 5 years ago

The other option would be to have a "controlled" vocabulary for this in the sense that for the string field to be valid (according to the standard) it needs to be in a specific format with a number and a unit. From an API perspective, one could still describe range queries on the object (give me all ages > "40 yr"), but this would mean that the implementation of the combined API service and repository would need to handle the comparison. From a repository perspective, we would almost certainly split this up into two fields as Scott suggests, but that doesn't necessarily mean that the API does.

FWIW, the other topic raised in the other thread is age ranges. Studies often provide age information in this way and there is no way to get an exact age. Internally, iReceptor Repositories already have an extension that stores age range (age in years represented by a floating point number in our case) so we can search on this.

schristley commented 5 years ago

The other option would be to have a "controlled" vocabulary for this in the sense that for the string field to be valid (according to the standard) it needs to be in a specific format with a number and a unit.

I would consider that more like a new "data type" versus a controlled vocabulary because all programs end-to-end would have to parse and interpret that field in a special way differently from the base data types like integers, strings and numbers. OpenAPI doesn't have a mechanism to specify custom formats like that, so we would have to define (and somehow enforce) this ourselves, which seems like added complexity. The two field proposal allows standard data types to be used.

bussec commented 5 years ago

MiniStd's take on this one was always that the MiAIRR "Age" field holds a physical quantity and everything else is considered formating (we discussed this at least twice). However, I understand that there should be an simple way to query on this information. Also, we did not think too much about ranges.

Let's tackle the representation in the schema first: Would it be sufficient if we had three keys age_min, age_max and age_unit?

schristley commented 5 years ago

It's perfect reasonable that UIs or data submission forms can combine the multiple fields into a formatted physical quantity age field for MiAIRR.

For ranges, I was thinking of being able to query with a range, but I suppose it is also relevant to specify a range? If yes, In the case when a single age is specified versus a range, it needs to be clear which field to use to hold the value, or both age_min and age_max must have the same value.

Doing a range query against a single value, versus doing a range query against another range, are two different queries. They require a different logical expression. Users would have to know ahead of time which one to use, or we have to define the fields in a way so that a single logical expression works for all cases.

schristley commented 5 years ago

Let's tackle the representation in the schema first: Would it be sufficient if we had 3 keys age_min, age_max and age_unit?

Should work. If we assume that age_min is used when specifying a single age value and age_max should be null, if age_max is not null then the age is assumed to be a range. The following expression should work for both when querying for a single age value (query_age).

((age_max is missing) and (age_min = query_age))
or ((age_max is not missing) and (age_min <= query_age) and (age_max >= query_age))

If querying for a range (query_age_min, query_age_max) then this expression can be used.

((age_max is missing) and (age_min >= query_age_min) and (age_min <= query_age_max))
or ((age_max is not missing) and (age_min <= query_age_max) and (age_max >= query_age_min))

The age field can be kept as a MiAIRR reporting field.

look good? @bcorrie @bussec

bcorrie commented 5 years ago

Yes, this seems reasonable to us... This is essentially what we are already doing... The only difference is that it might be a good idea to suggest (require?) that if an explicit age is given that age_min == age_max. Then you can simply use (age_min <= query_age) and (age_max >= query_age) and don't have to check the special case of one of them being missing...

schristley commented 5 years ago

The only difference is that it might be a good idea to suggest (require?) that if an explicit age is given that age_min == age_max.

That seems reasonable. I initially avoided because I wasn't sure how code would know if the data represented a single age or a range, but I guess they can check if age_min = age_max.

However, if we do, we need to "require" it, otherwise data entry errors will cause queries to fail, that is, all code everything which sets the age must remember to set both fields. Unfortunately there isn't a way to specify this in the schema so it would need to be documented, seems fragile...

Maybe instead of age_min and age_max, we name the fields age_magnitude and age_range. Then age_magnitude is the single value and becomes the lower bound for a range with age_range being the upper bound. Suggestions on other names?

But I tend not to like having fields which can be interpreted in multiple ways. Maybe we should be more explicit and separate the range and magnitude fields with age_magnitude, age_min, age_max.

bussec commented 5 years ago
  1. While I like the age_magnitude and age_range idea in general, IMO it makes more sense to define
    • age_magnitude as the mean of age_min and age_max and
    • age_range as (age_max-age_min)/2
  2. I would like to propose that age_unit should hold an SI compliant unit symbol (i.e. "a", not "y" or "yr").
schristley commented 5 years ago

I would like to propose that age_unit should hold an SI compliant unit symbol (i.e. "a", not "y" or "yr").

I'm not familiar with "a" as a unit of time. I'm fine with restricting to standard unit symbols. Is there a table that we should incorporate as a controlled vocabulary?

schristley commented 5 years ago

@bcorrie @bussec Have we come to a consensus on this? It would be nice to have this specified more clearly for ADC API V1

bcorrie commented 5 years ago

No consensus yet that I am aware of but I was thinking along the same lines - that we should try and resolve for the ADC API.

I find the use of age_magnitude a bit confusing. It doesn't really sound like a magnitude to me - magnitude usually means size... Are we talking about size? Why not just "age" and "age_range"? But then that is confusing because if you take age alone, it is misleading. Maybe age_average and age_range as that is what they truly are. But I find that a bit confusing as well...

I think it might be best to lean towards using terminology and fields that most closely map to how a an experimenter would describe this. I think they would say something like "... Subject A was female between 10 and 20 years old" => age_min, age_max, age_unit? Correct me if I am wrong...

schristley commented 5 years ago

The underlying issue is how to represent either a point value or an interval, and then allow queries which may either be a point value or interval.

Query with a point value "age":30 could have two interpretations:

  1. want only records with exactly age=30 (so no records with intervals)
  2. I want records where age is exact or falls in the interval (so a record with interval [20,40]) would be returned.

Query with an interval requires

  1. If the record has a point value, does it lie within the interval
  2. If the record has an interval, do the intervals overlap

Technically, you could specify how the intervals overlap, like one interval completely encloses the other...

So those are the queries we need to support, what is the best way to structure the data so they can be easily supported.

Let's assume we have fields age (which represents the point value, or the range minimum) and age_range_max, the queries (in pigdin json) would be:

For query (1), except this doesn't exclude records which have a range where the lower bound is exactly the age, e.g. [30,50]

"age":30

For query (2)

"age", ">=", 30, "and", "age_range_max", "<=", 30

For query (3) with interval [20,40], here is the proper query if the record has a point value

"age", ">=", 20, "and", "age", "<=", 40

For query (4) with interval [20,40], here is the proper query (interval overlap) if the record has an interval

"age", ">=", 20, "and", "age", "<=", 40

"or"

"age_range_max", ">=", 20, "and", "age_range_max", "<=", 40

so the queries almost work out except for (1).

schristley commented 5 years ago

If we don't want to have that query (1) handled improperly, then we need to have separate fields for a point value versus an interval.

Let's assume we have fields age (which represents the point value) and age_range_min and age_range_max (which represent an interval), the queries (in pigdin json) would be:

For query (1), this properly

"age":30

For query (2)

"age":30

"or"

"age_range_min", ">=", 30, "and", "age_range_max", "<=", 30

For query (3) with interval [20,40], here is the proper query if the record has a point value

"age", ">=", 20, "and", "age", "<=", 40

For query (4) with interval [20,40], here is the proper query (interval overlap) if the record has an interval

"age", ">=", 20, "and", "age", "<=", 40

"or"

"age_range_min", ">=", 20, "and", "age_range_min", "<=", 40

"or"

"age_range_max", ">=", 20, "and", "age_range_max", "<=", 40

So my conclusion is that we need three fields.

bussec commented 5 years ago

Sounds good to me, just to make sure that I understood the implication for annotators:

  1. If you have an exact age, you only annotate age while age_range_* stay NULL
  2. If you have an age range, you annotate the age_range_* keys but not age

Meaning that if all three keys are NOT NULL something is wrong.

And coming back to your earlier suggestion, I agree that would would make sense to have a controlled vocabulary for units and their abbreviations.

bcorrie commented 5 years ago

OK, my head hurts consider all of those options. 8-) Using my suggestion of age_min and age_max, where for a point value we require age_min = age_max, I think the concept is more intuitive (you are describing a min and a max for the age range) and the logic is easier... The only down side is the case where you have to have both min and max set. If you wanted to support one of them being null (indicating a point age), then the calculation of the other can be done when you test for equality.

For query (1), want all data that is exactly 30

30 >= age_min && 30 <= age_max

For query (2), want all data where the exact query number (30) is within the range

30 >= age_min && 30 <= age_max (same as above)

For query (3) with interval [20,40], assuming record has an point age (e.g. age_min == age_max == 30), here is the proper query

40 >= age_min && 20 <= age_max 

For query (4) with query interval [20,40], assuming record has a range (e.g. age_min == 10 age_max == 30), here is the proper query

40 >= age_min && 20 <= age_max (same as above)

If I look at the logic, I think this works - given the possibilities below:

Age range possibilities
        age_min                      age_max
              10                             30 (interval overlap low == True)
              30                             45 (interval overlap high == True)
              30                             35 (query interval contains age interval == True)
              10                             50 (query interval contained by age interval == True)
              10                             15 (query interval outside age interval low == False)
              45                             50 (query interval outside age interval high == False)

So the logic is much simpler. There are two cases, if the query is a point query you do this:

30 >= age_min && 30 <= age_max

and if the query is a range query you do this:

40 >= age_min && 20 <= age_max 

If you want closed or open intervals you add/drop the = in the equality. If you want to support one of them being null rather than enforcing both, before calculating you can do:

if age_min == NULL age_min = age_max else if age_max == NULL age_max = age_min

If both are NULL you don't have an age...

bcorrie commented 5 years ago

Actually, I think it is simpler than that. I think a simple:

query_max >= age_min && query_min <= age_max

works for all cases if you promote a point age or point query to an interval where min==max.

schristley commented 5 years ago

Actually, I think it is simpler than that.

Brilliant!

bcorrie commented 5 years ago

So does that mean we want to go with age_min, age_max, and age_unit

"... Subject A was female between 10 and 20 years old" => age_min = 10, age_max = 20, age_unit = "y" or "a" (yet to be agreed on unit terminology).

With the main "complication" being exactly how to specify a point age. We could use:

Being equivalent and all implying a point value of 10 units of age. We could alternatively enforce one of these being the correct way of specifying a point age and make the others error conditions.

age_min = null, age_max = null means there is no age.

age_min != null and age_max != null and age_unit = "value" is an error (we have no age but an age_unit - this doesn't make sense)

age_min != null or age_max != null and age_unit = null is an error (we have an age but no unit - this doesn't make sense).

schristley commented 5 years ago

I suppose the only question I had is do we keep age to be a combined display value, and the three new fields are essentially just for the API for doing query? Or are we deprecating age?

bcorrie commented 5 years ago

Dare I say deprecate age? I think that makes sense for the DataRep part (the actual YAML spec) for sure.

bussec commented 5 years ago

I agree with deprecating age and already put this change on the agenda for this week's MiniStd call.

bussec commented 5 years ago

The proposed changes were accepted by MiniStd. As this is a MiAIRR field, should we follow the (proposed) deprecation process in #248 ?

schristley commented 5 years ago

The proposed changes were accepted by MiniStd. As this is a MiAIRR field, should we follow the (proposed) deprecation process in #248 ?

I agree.

Any suggestions on how we document this well? The MiAIRR section of the doc under AIRR Standards is pretty sparse, while the data submission section is heavily weighted on NCBI implementation. Should we expand the MiAIRR section, or should we put all this under the AIRR Data Model and the schema definitions?

bcorrie commented 5 years ago

Do we want to suggest or specify a default for age_unit? If not, what does it mean to have an age_min or age_max without an age_unit?

bcorrie commented 5 years ago

Controlled vocabulary for units of age here: https://www.iau.org/publications/proceedings_rules/units/

Quantity Unit: Name Symbol Value
time (1) minute min or " 60 s
time hour h 3600 s = 60 min
time day d 86 400 s = 24 h
time year (Julian) a 31.5576 Ms = 365.25 d
bussec commented 5 years ago

If not, what does it mean to have an age_min or age_max without an age_unit?

Wild guess: That the annotator skipped physics classes in Junior High?

bcorrie commented 5 years ago

Funny, but not helpful 8-)

bussec commented 5 years ago

Darn, I got it wrong again! So lets try helpful, but not funny instead:

A physical quantity without a unit is formally wrong (unless it is dimensionless, which measure of time are not). Therefore IMO the question should not be "what do we default to if people fail to provide it" but rather "how do we keep people from enter invalid data".

bussec commented 5 years ago

Looking at this, there are three more things that crossed my mind:

  1. In #254 we specify age_min and age_max as integer. Should this be number instead (AFAIK there is no separate float data type in OpenAPI, right @schristley ? ).

  2. We now have for the first time the situation that a single MiAIRR field ("Age") is composed of several keywords (age_min, age_max, age_unit ). Would this create any problem?

  3. A very simple way of dealing with our unit discussion would be to always encode time in seconds (by definition) and leave it to the UI to sort out the conversion. We should just avoid using int32 as data type ;-). The downside of this is that it would not be human-readable anymore, so from a MiniStd perspective it is not the preferred solution, but I wanted to hear the DataRep / ComRepo ideas on this.

schristley commented 5 years ago

Controlled vocabulary for units of age here:

I believe we also want weeks (common for mice) and months. My preference is that we don't use abbreviations but have the controlled_vocabulary be the full words.

schristley commented 5 years ago

Do we want to suggest or specify a default for age_unit?

I think having a default, which is potentially wrong, is worse than leaving it blank, though if the default was something like "unspecified" then that might be okay. But I'm still not clear how that is any different from null.

If not, what does it mean to have an age_min or age_max without an age_unit?

It's a validation error. If you want, you can create some new x-airr attributes that defines the relationship, and the validation routines can check for them. Or you can just add these checks into your repository. Or you can just document it. Or do all three.

schristley commented 5 years ago

In #254 we specify age_min and age_max as integer. Should this be number instead (AFAIK there is no separate float data type in OpenAPI

yes, they should be number

We now have for the first time the situation that a single MiAIRR field ("Age") is composed of several keywords (age_min, age_max, age_unit ). Would this create any problem?

not yet! Conceptually I don't see it as a problem as MiAIRR is (abstractly) about information while the three fields are the implementation. Technically, we might have to alter some scripts, and etc., but that's usually no big deal. We might find in the future that we want to do this for other physical quantities as well, age was just the most obvious one.

A very simple way of dealing with our unit discussion would be to always encode time in seconds

yuck!