airr-community / airr-formats

PLEASE SEE airr-standards FOR FURTHER DEVELOPMENT: https://github.com/airr-community/airr-standards
MIT License
1 stars 2 forks source link

[vdjc]_call VS [vdjc]_allele #45

Closed bcorrie closed 6 years ago

bcorrie commented 7 years ago

Hello All,

Any reason why the format group has a different name for the VDJC information than the Minimum Standards group? Correct me if I am wrong, but these are reporting the same thing, no? The mapping I have is:

MiAIRR Formats
cell_index cell_index
v_allele v_call
d_allele d_call
j_allele j_call
c_allele c_call
junction_nt junction_nt
junction_aa junction_aa
duplicate_count duplicate_count

Everything is there, but the names are different. Does that imply they are different things or is that an oversight?

schristley commented 7 years ago

They are the same, the names are just out of sync. Formats used [vdjc]_call as that was the field names used by Change-O so it became the initial design. If we change the field names in the formats spec, we need to file an issue with Change-O to update its field mapping. @javh

javh commented 7 years ago

I don't think these are necessarily the same. For the GenBank submission standards, if that's what we are talking about, it would look something like this:

V_segment       93..388
                /gene="IGHV4-39"
                /allele="01"
                /db_xref="IMGT/GENE-DB:IGHV4-39"
                /inference="similar to DNA sequence:IMGT/HighV-QUEST:1.5.5"

So the V gene and V allele are sub-components of the V inference call. Ie, v_call maps to both v_gene and v_allele.

schristley commented 7 years ago

Okay, so if GenBank needs them separate then we likely also need to separate these as two different fields. Relying upon some parsing rules to extract the gene and allele will bite us down the road IMO.

javh commented 7 years ago

My preference could be to leave the inference as a single [vdjc]_call field.

Not every aligner makes allele level calls. Plus, there's already a fair amount of parsing that needs to happen to go from the alignment data to the GenBank submission. Eg, you might start with something like Homsap IGHV6-1*01 F,Homsap IGHV6-1*02 F as your v_call, but then need to extract the genes/alleles from that and resolve any ambiguity.

bussec commented 7 years ago

If you want to go with a single field (along the lines of MiAIRR), *_call is good as it is generic (in contrast to *_allele, *_gene, etc.), thus I would be in favor of using it instead of the other alternatives.

The issue is - as already noted - if some downstream process requires the individual components (locus, type, family, number, alllele), as it tends to be difficult to parse the information from the string. Also note that the typical IMGT format only applies to humans and is at variance with standard mouse nomenclature.

bcorrie commented 7 years ago

We should ensure that it is very clear which fields in the Formats spec are identical to which fields in the MiAIRR spec. The obvious way to do this is ensure the field names are the same, but I think we want to make sure that this is explicitly pointed out in the Formats documentation. That is, we should be clear to the outside world that there is (at least I think there is 8-) a direct link between the Formats field X (e.g. v_call) and the MiAIRR field Y (e.g. vgene_allele), assuming that happens to be the mapping. Obviously, this link is more clear if the field names are the same once we have an agreed on definition.

In fact, it probably makes sense for the Formats YAML definition include the "6 / data (proc. seq." field definitions from the MiAIRR YAML file to ensure that they are indeed the same. This is what we are doing with the iReceptor API YAML/Swagger definition.

schristley commented 7 years ago

When airr-formats is merged into airr-standards then they will be the exact same fields, so don't confuse the temporary situation of there being two specs. There is only one.

laserson commented 7 years ago

Seems like consensus is for _call. I'm closing this with airr-community/airr-standards#33.

javh commented 7 years ago

Reopening just so we remember to change this:

c_call and constant are redundant.

schristley commented 6 years ago

constant was removed.