airr-community / airr-formats

PLEASE SEE airr-standards FOR FURTHER DEVELOPMENT: https://github.com/airr-community/airr-standards
MIT License
1 stars 2 forks source link

Finalize mandatory/non-mandatory #8

Closed laserson closed 7 years ago

schristley commented 7 years ago

Currently the "constant" column is marked as mandatory. How do we know what the constant region is? I think a lot of data doesn't sequence far enough into the constant region to know.

schristley commented 7 years ago

The minimal standards group lists the germline database, the V,D,J gene calls, the CDR3 aa and na sequences, and the read count as their required fields. We have the germline database in the metadata, and the V,D,J gene calls are in our mandatory list. The other 3 fields are not in our mandatory list, I think we should add them.

javh commented 7 years ago

I don't think the CDR3 amino acid sequence, or any amino acid sequence, should be mandatory, but we should define a standard naming for them in the optional fields. (See #7)

schristley commented 7 years ago

The minimal standards WG is making the CDR3 aa and na sequences mandatory, and they are about to enshrine that in an AIRR manuscript. So we might need to have a joint meeting if this needs to be hashed out. My guess is that their viewpoint is CDR3 is such a commonly studied sequence, that it is provided for convenience. Otherwise, they need to interpret the CIGAR strings, which requires coding, and somebody who understands CIGAR...

laserson commented 7 years ago

And also to clarify, "mandatory" fields simply mean that when writing an AIRR-formatted file, that column must be present in the TSV file. But the values in that column are free to be null.

I'm not sure about requiring translations, and I don't think we should have to follow minimal standards lead here. They are mandating reporting of those annotations as a convenience. And I'm guessing you could report null values for them as well. But I'd hate to have a situation where we mandate both DNA and AA seqs and an aligner feels free to just fill one of them out. (i.e., if the TSV file contains a cdr3_aa column, it should be there because it's using it, not just bc it's required.)

schristley commented 7 years ago

This got me thinking, is it actually possible to get the CDR3 sequence given the mandatory fields? Taking the cigar fields, and the query sequence, should reconstruct the alignment with germline genes, but then don't we need positional information?

laserson commented 7 years ago

That's a good point...strictly speaking it does not seem possible, since you'd need access to the germline database in order to liftover the CDR3 start/end positions.

scharch commented 7 years ago

@laserson Aren't the [cdr3_start, cdr3_end] coordinates on the read itself (as given in the sequence column)?

schristley commented 7 years ago

@scharch yes, and it needs to determine those coordinates from the germline database (or some other way?). Jason can probably describe how ChangeO does it, but for our repsum tool, we pull out the annotated CYS (for the start) and TRP/PHE (for the end) from the IMGT database, then translate those positions from the germline sequence onto the query sequence given the alignment info. Onto the V alignment for start and onto the J alignment for end. Then there are various checks, like does the query sequence actually cover that whole region, and do the CYS and TRP/PHE actually exist. So our standard will need to mandate position information, otherwise every tool would need to do that above calculation itself to extract the CDR3 sequence.

javh commented 7 years ago

We use the same basic process to redefine the CDR3 from IgBLAST output. Find conserved residue motifs in the germline sequence, then find the matching positions in the query sequence using the alignment start/end positions, and define the CDR3 boundaries from that info.

CDR1/2 are defined by converting the output to IMGT-gapped sequences through matching against the IMGT-gapped V-segment germlines.

schristley commented 7 years ago

So it sounds like based upon the discussion here, plus in #10 that we need to add these fields and make them mandatory?

germline_v_sequence germline_d_sequence germline_j_sequence start and end positions for framework and cdr regions (on the query)

Should we also add the start/end positions for framework and cdr regions on the germline? I'm not sure if they can be calculated purely from the CIGAR. Maybe add them just for convenience?

scharch commented 7 years ago

it needs to determine those coordinates from the germline database... ...need to mandate position information, otherwise every tool would need to do that above calculation itself

I'm still confused. Do you just mean that we need to specify, say, the IMGT definition vs the kabat definition? Or just that it might be worth making cdr3_start and cdr3_end into mandatory fields? As long as those are present, the strategy used by the original tool shouldn't matter, right?


add these fields and make them mandatory...

I am not a fan of this, as it will clutter the file and make it much more difficult to inspect manually. It's also ends up being enormously redundant (for a file with an entire repertoire), but that's probably less of a concern. It sounds like we're perhaps moving back toward a multiple-file format anyway (#13), so maybe the spec could recommend/require that a germline sequence file be carried together with the annotation file??

schristley commented 7 years ago

Or just that it might be worth making cdr3_start and cdr3_end into mandatory fields?

Yes, and not just cdr3_start and cdr3_end but the other position fields as well because some tool might want analyze (say) fwr2 mutations.

laserson commented 7 years ago

I am also against replicating the germline sequences into the file. There should be a pointer in the header to the version of the germline database being used.

schristley commented 7 years ago

But how do you get the germline database? IMGT doesn't allow theirs to be redistributed, so for example, VDJServer cannot put its db up on the web for tools to use. Does it make sense to carry some of the information forward, or do we essentially punt on this issue?

I'm on the germline database WG, and it's pretty clear the IMGT license is not going to be changed. The best we've been able to work out with them is that they will work with us to enhance their current db to store additional information, e.g. inferred alleles.

laserson commented 7 years ago

I am assuming that anyone working with this data will have a local copy of the germline database available. So VDJServer will have its local copy, which it can use to reconstruct the germline when relevant. Or if I am doing analysis on my personal laptop, for example, I'd probably have a local copy there.

nishmm commented 7 years ago

I second replicating germline sequences in the file or in a separate file (with pointers in the annotation file) in cases where pointers to standard references like IMGT germline dB versions are unavailable or when those standard versions are not used. Especially considering that the use of customized set of germline gene sequences (inferred alleles, limiting germline genes/alleles from a particular mouse strain,etc.) would be, if not already, a more common practice. It would be once as part of the header. This works if the annotated sequences in the file are from one instance of annotation analysis and not from two different annotation analysis with two different reference germline gene sequences.

laserson commented 7 years ago

That's an interesting point, @nishmm. Rather than replicating the germlines into each record in the file, we could put supplementary germline seqs into the YAML header without incurring a huge cost. Personally, I don't like this solution either. My impression is that the OGRDB format should make it very easy to specify custom germline sets, so I'd still vote for simply putting a pointer to your particular germline set in the YAML. You can distribute the OGRDB file separately if you want.

laserson commented 7 years ago

Closing for now. Will be addressed in the metadata schema that will be added.