FWR/CDR annotations - Githubissues

airr-community / airr-formats

PLEASE SEE airr-standards FOR FURTHER DEVELOPMENT: https://github.com/airr-community/airr-standards

MIT License

1 stars 2 forks source link

FWR/CDR annotations #10

Closed laserson closed 7 years ago

laserson commented 7 years ago

Do we have absolute coordinates or do we have coordinate + offset

schristley commented 7 years ago

There are two coordinates to think about. The coordinates of the regions for the germline V genes, and where those coordinates lie in the query sequence when aligned with the germline. Technically the coordinates for the germline reside within the germline database, but they may not be easy to extract.

My vote would be absolute coordinates, I think that would be easier for some calculation where indels are involved.

scharch commented 7 years ago

I agree with @schristley - better to have each field stand independently, rather than having to combine, say, v_start and cdr3_start to extract the CDR3.

javh commented 7 years ago

I don't have a strong opinion on whether we use [start, end] vs [start, length].

I'm inclined to think we should include both query (input sequence) and reference (germline) positions. However, we could have the germline positions be fields derived from the CIGAR string.

laserson commented 7 years ago

I am a fan of [start, end] as it's more consistent with Python slice notation, which I personally find more intuitive. It also makes more sense to me in cases where you can't make an alignment. For example, if you annotate a read that is truncated starting in the CDR3 through the 3' end of the VDJ region, it would be strange to me to have [cdr3_start, cdr3_length]. I'd rather just set the cdr3_end value to be a null value.

I'm not sure why the germline coords need to be stored in the annotation, as long as you mark exactly which germline reference set you used. (Which should be clearly specified in the metadata.) In my mind, the annotation performs a "liftover" operation onto the read. And the germline should be reconstructable (approx?) from the CIGAR string.

javh commented 7 years ago

To get the germline start position from the CIGAR string we would have to enforce the hard masking. Maybe we'd also have to enforce soft masking for situations with truncated germline reference sequences? Not sure about that. Which, I think we should probably do anyway, but as I understand it that's less common.

From: http://genome.sph.umich.edu/wiki/SAM#What_is_a_CIGAR.3F

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  G  C  T

With the alignment above, you get:
POS: 5
CIGAR: 3M1I3M1D5M

We would have to make it 5H3M1I3M1D5M5H instead. (If I understand correctly.)

My only real concern with extracting it from the CIGAR string is compliance - the aligner would have to write a CIGAR string with the optional masking commands. Without it, you couldn't rebuild the full germline from the aligner's output.

schristley commented 7 years ago

Now that I'm reading up on CIGAR, I see a couple other things. When a mismatch occurs, it doesn't indicate which nucleotide changed and where. Also, when a deletion occurs, it loses the nucleotides in the germline sequence. It seems that the germline sequence is required in order to completely reconstruct the alignment, and really do any analysis?

scharch commented 7 years ago

I think that way back we had discussed a more complete format that would include that kind of information, though the details elude me now...

javh commented 7 years ago

Yeah, direct comparison of the full inferred germline sequence against the observed sequence is essential for a lot of analyses. I think we want to make that as easy as possible.

laserson commented 7 years ago

IIRC, there is a variant of CIGAR that lets you specify the mismatched bases?

nishmm commented 7 years ago

I remember from that discussion that the CIGAR was selected because the germline sequences will be made available anyway, either through pointers to a standard dB version or through replicating them in the AIRR format file(s).

laserson commented 7 years ago

k, seems like there isn't a big change of direction here. gonna close this.