Closed laserson closed 7 years ago
There are two coordinates to think about. The coordinates of the regions for the germline V genes, and where those coordinates lie in the query sequence when aligned with the germline. Technically the coordinates for the germline reside within the germline database, but they may not be easy to extract.
My vote would be absolute coordinates, I think that would be easier for some calculation where indels are involved.
I agree with @schristley - better to have each field stand independently, rather than having to combine, say, v_start and cdr3_start to extract the CDR3.
I don't have a strong opinion on whether we use [start, end]
vs [start, length]
.
I'm inclined to think we should include both query (input sequence) and reference (germline) positions. However, we could have the germline positions be fields derived from the CIGAR string.
I am a fan of [start, end]
as it's more consistent with Python slice notation, which I personally find more intuitive. It also makes more sense to me in cases where you can't make an alignment. For example, if you annotate a read that is truncated starting in the CDR3 through the 3' end of the VDJ region, it would be strange to me to have [cdr3_start, cdr3_length]
. I'd rather just set the cdr3_end
value to be a null
value.
I'm not sure why the germline coords need to be stored in the annotation, as long as you mark exactly which germline reference set you used. (Which should be clearly specified in the metadata.) In my mind, the annotation performs a "liftover" operation onto the read. And the germline should be reconstructable (approx?) from the CIGAR string.
To get the germline start position from the CIGAR string we would have to enforce the hard masking. Maybe we'd also have to enforce soft masking for situations with truncated germline reference sequences? Not sure about that. Which, I think we should probably do anyway, but as I understand it that's less common.
From: http://genome.sph.umich.edu/wiki/SAM#What_is_a_CIGAR.3F
RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Reference: C C A T A C T G A A C T G A C T A A C
Read: A C T A G A A T G G C T
With the alignment above, you get:
POS: 5
CIGAR: 3M1I3M1D5M
We would have to make it 5H3M1I3M1D5M5H instead. (If I understand correctly.)
My only real concern with extracting it from the CIGAR string is compliance - the aligner would have to write a CIGAR string with the optional masking commands. Without it, you couldn't rebuild the full germline from the aligner's output.
Now that I'm reading up on CIGAR, I see a couple other things. When a mismatch occurs, it doesn't indicate which nucleotide changed and where. Also, when a deletion occurs, it loses the nucleotides in the germline sequence. It seems that the germline sequence is required in order to completely reconstruct the alignment, and really do any analysis?
I think that way back we had discussed a more complete format that would include that kind of information, though the details elude me now...
Yeah, direct comparison of the full inferred germline sequence against the observed sequence is essential for a lot of analyses. I think we want to make that as easy as possible.
IIRC, there is a variant of CIGAR that lets you specify the mismatched bases?
I remember from that discussion that the CIGAR was selected because the germline sequences will be made available anyway, either through pointers to a standard dB version or through replicating them in the AIRR format file(s).
k, seems like there isn't a big change of direction here. gonna close this.
Do we have absolute coordinates or do we have coordinate + offset