airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Add adc-api-optional to AIRR extension attributes #747

Closed bcorrie closed 4 months ago

bcorrie commented 4 months ago

Closes #658

bcorrie commented 4 months ago

This is a minor definition change, the bulk of the other changes (removing adc-api-optional from the AIRR Spec files) are already in master.

I will merge and close this shortly unless anyone raises a concern.

schristley commented 4 months ago

Hi @bcorrie , I might have missed some discussion about this, but I notice that the fields are names adc_v_subgroup, adc_v_gene and so on with the adc prefix. Was there a particular reason for that, like it conflicts with other names, or was it just to differentiate?

In our "gentleman's agreement" we used names without the adc. Is it possible to remove the adc? The reason I ask is because I believe it will cause issues down the road, or at least make interoperability a little harder.

There are two situations, one is with queries, so ADC queries on those fields need to use those names. For VDJServer that means either 1) reload all the rearrangement data with the proper names, or 2) put in a hack in that translates those field names when constructing the query. So 2) is not too bad, if I need to then I will.

The other situation is with the output. This means that the Rearrangement TSV should have column names adc_v_subgroup and etc. I'm thinking this makes it somewhat confusing for downstream analysis and tools. Because those become the official names, code is written to use them. There's a lot of downstream analysis code that might need to be rewritten to use the names, especially seeing that the convention for many years has been without the adc. For example, the iR+ Stats API has v_subgroup_exists, shouldn't that be adc_v_subgroup_exists to be more accurate? Analysis code that uses those fields would have know to use v_call for the allele but adc_v_subgroup and adc_v_gene for the others.

bcorrie commented 4 months ago

Hi @bcorrie , I might have missed some discussion about this, but I notice that the fields are names adc_v_subgroup, adc_v_gene and so on with the adc prefix. Was there a particular reason for that, like it conflicts with other names, or was it just to differentiate?

Don't your remember #258 and #295 - just those issue and pull request numbers give me nightmares 8-)

They were named that way to differentiate the fields as specific to the ADC API and that it was clear they were not part of the AIRR file standard. They are "convenience" fields to help us query repositories efficiently.

In our "gentleman's agreement" we used names without the adc. Is it possible to remove the adc? The reason I ask is because I believe it will cause issues down the road, or at least make interoperability a little harder.

In looking at it, possibly. The reason for the naming was at least partially driven by the time the fields were actually in the AIRR schemas (e.g. airr_schema.yaml). Although they had the adc-api-optional AIRR attribute, it was considered that these terms were "polluting" the AIRR spec itself. Personally I didn't agree - see the issues as to why - but we eventually agreed and we moved them to the ADC API spec instead.

Since they are now in the ADC API spec, we can change them back to the basic names if we want. The one benefit I see is having them named slightly differently makes it clear that they are NOT AIRR Standard fields, they are ADC API fields.

There are two situations, one is with queries, so ADC queries on those fields need to use those names. For VDJServer that means either 1) reload all the rearrangement data with the proper names, or 2) put in a hack in that translates those field names when constructing the query. So 2) is not too bad, if I need to then I will.

We have such a mapping built in to all of our components, so this is easy for us to do. That is how we deal with spec changes over time. We have mappings for all fields - and yes it is horribly ugly and cumbersome to maintain, but it works (https://github.com/sfu-ireceptor/config). It is a good thing the AIRR Spec changes so slowly 8-)

The other situation is with the output. This means that the Rearrangement TSV should have column names adc_v_subgroup and etc. I'm thinking this makes it somewhat confusing for downstream analysis and tools. Because those become the official names, code is written to use them. There's a lot of downstream analysis code that might need to be rewritten to use the names, especially seeing that the convention for many years has been without the adc. For example, the iR+ Stats API has v_subgroup_exists, shouldn't that be adc_v_subgroup_exists to be more accurate? Analysis code that uses those fields would have know to use v_call for the allele but adc_v_subgroup and adc_v_gene for the others.

Actually according to the Rearrangement TSV spec, these fields are not part of the spec. They are in the ADC API only. Since there is no v_subgroup or no adc_v_subgroup field in the AIRR Rearrangement spec, you can return these fields using either of these column names and it does not break the spec. Downstream tools should never expect these fields to be present, because they are not part of the AIRR Rearrangement spec.

bcorrie commented 4 months ago

Hmm, I just checked and the iReceptor Gateway is querying repositories for v_subgroup, not adc_v_subgroup. That is because there is an error in our config for this field, but this means that the Gateway is still adhering to our "agreement" - which is why it never broke when we went to ADC API v1..2 (which has adc_v_subgroup) 8-)

https://github.com/sfu-ireceptor/config/blob/c677d0ed2cfb4233108d0c141b910009af075c77/AIRR-iReceptorMapping.txt#L316

So I would vote for reverting back to v_subgroup rather than adc_v_subgroup - since it is only in the ADC API spec, and those fields have adc-api-optional listed so it is possible to see which fields in the spec are "helper query fields". It seems awkward to encode this in the field name, I agree.

bcorrie commented 4 months ago

@schristley made the change - if no objections I will merge...

schristley commented 4 months ago

@schristley made the change - if no objections I will merge...

Great! Yes, that definitely makes my life easier. No objections, looks good.