Open hpages opened 4 years ago
Hi Herve,
I don't know if this is related but I've seen some more odd behavior while using the package today. Please see the example below. The first 2 positions correspond to the 2 X
s. mapFromAlignments
reduced the width of the second range to 0.
> mypos <- IRanges(c(52, 67, 68), width = 1, names=rep("read1", 3))
> mapFromAlignments(mypos, GAlignments("chr1", pos=1L, cigar="27S39=1I11=1X14=1X5=", strand("+"), names="read1"))
IRanges object with 3 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
read1 52 52 1
read1 67 66 0
read1 67 67 1
The issue is at https://github.com/Bioconductor/GenomicAlignments/blob/4237abd8fa0086ceab25d07a68eba1c3187c2c43/src/coordinate_mapping_methods.c#L169-L223
Although query_loc
is supposed to have trimmed positions on the query, i.e., after soft clipping, this block of code still consume the clipping on the read.
To demonstrate, if you remove the clipping from the alignment, everything looks fine
mypos <- IRanges(1:7, width=1, names=rep("read1", 7))
alignment <- GAlignments("chr1", pos=1L, cigar="3M2D7M", strand("+"), names="read1")
mapFromAlignments(mypos, alignment)
IRanges object with 7 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
read1 1 1 1
read1 2 2 1
read1 3 3 1
read1 6 6 1
read1 7 7 1
read1 8 8 1
read1 9 9 1
If you increase the clipping size, even the last 2 positions are wrong:
mypos <- IRanges(1:7, width=1, names=rep("read1", 7))
alignment <- GAlignments("chr1", pos=1L, cigar="7S3M2D7M", strand("+"), names="read1")
mapFromAlignments(mypos, alignment)
IRanges object with 7 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
read1 1 1 1
read1 2 2 1
read1 3 3 1
read1 4 4 1
read1 5 5 1
read1 6 6 1
read1 7 7 1
Also, soft clipping on the other side is also an issue:
mypos <- IRanges(10:11, width=1, names=rep("read1", 2))
alignment <- GAlignments("chr1", pos=1L, cigar="3M2D7M2S", strand("+"), names="read1")
mapFromAlignments(mypos, alignment)
IRanges object with 2 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
read1 12 12 1
read1 13 13 1
One way to fix this is to change this function (to_ref
) to not consume the soft clipped bases.
Hi @gaoce ,
Thanks for taking the time to look at the code. Would you be willing to submit a PR? Problem is that I'm not familiar with the implementation of mapFromAlignments()
(the function was implemented many years ago by someone who's left the Bioconductor sphere), and I never really found the time to familiarize myself with it.
Best, H.
Hi @hpages,
Sure. I will find some time to work on it.
Hi @gaoce and @hpages,
I've bumped in the same problem identified by @hpages. I'd like to point out some further subtleties in mapFrom and mapTo with soft/hard clips. While the genome coordinates are uniquely defined for an alignment, the coordinates within the alignement itself can be defined in several ways: within the mapped part of the read without clipped parts, or counting the soft (or hard) clipped parts also. Note, that both options can be useful depending on the question asked. For example, for soft clipped reads the whole sequence (and qualities) is stored in the record, so indexing into it requires the coordinates including the clipped parts. In case of hard clipped alignments usually discarding the clipped parts is sensible, but for liftover like tasks, when alignment is not a read but an alternative shifted genome, it is required to get coordinates in the alignment including hard clips (this is the case with minimap2 assembly to assembly mapping).
For example, for hard clipped alignment GAlignments("chr1", pos=3L, cigar="1H2M1H", strand("+"), names="read1")
The possible mappings can be
Mapped Read portion Pos 1 2
Read seqeuence Pos 1 2
Read position 2 3
Read CIGAR H M M H
chr1 pos 1 2 3 4 5 6
And for the soft clipped case GAlignments("chr1", pos=3L, cigar="1S2M1S", strand("+"), names="read1")
Mapped Read portion Pos 1 2
Read seqeuence Pos 2 3
Read position 2 3
Read CIGAR S M M S
chr1 pos 1 2 3 4 5 6
To add to the confusion, in the current implementation mapFromAlignments
uses the convention "mapped read position", and mapToAlignments
uses "read sequence position"... This makes these two operations not to be each other inverse in case of the soft clipped alignments.
I think that a good option is to add an additional parameter style = mapped|sequence|readpos
to the mapping functions, which would select one of the three options, with the default being the current implementation (mapped for mapFromAlignments
and sequence for mapToAlignments
). Then the existing code should not break, while allowing for all options to be selected for each use case. I am adding a pull request #35 that implements this, and fixes the original bug mentioned in the issue.
So, the suggested behaviour should be like this
readpos <- GRanges("r1",IRanges(1:6, width=1, names=rep("read1",6)))
chrpos <- GRanges("chr1",IRanges(1:6, width=1, names=LETTERS[1:6]))
In this case "mapped" is the default if called without style parameter
alignment <- GAlignments("chr1", pos=3L, cigar="2M", strand("+"), names="read1")
mapFromAlignments(readpos, alignment)
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## read1 chr1 3 * | 1 1
## read1 chr1 4 * | 2 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
alignment <- GAlignments("chr1", pos=3L, cigar="1H2M1H", strand("+"), names="read1")
mapFromAlignments(readpos, alignment, style="mapped")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## read1 chr1 3 * | 1 1
## read1 chr1 4 * | 2 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mapFromAlignments(readpos, alignment, style="sequence")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## read1 chr1 3 * | 1 1
## read1 chr1 4 * | 2 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mapFromAlignments(readpos, alignment, style="readpos")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## read1 chr1 3 * | 2 1
## read1 chr1 4 * | 3 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
alignment <- GAlignments("chr1", pos=3L, cigar="1S2M1S", strand("+"), names="read1")
mapFromAlignments(readpos, alignment, style="mapped")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## read1 chr1 3 * | 1 1
## read1 chr1 4 * | 2 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mapFromAlignments(readpos, alignment, style="sequence")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## read1 chr1 3 * | 2 1
## read1 chr1 4 * | 3 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mapFromAlignments(readpos, alignment, style="readpos")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## read1 chr1 3 * | 2 1
## read1 chr1 4 * | 3 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
In this case "sequence" is the default if called without style parameter
alignment <- GAlignments("chr1", pos=3L, cigar="2M", strand("+"), names="read1")
mapToAlignments(chrpos, alignment)
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## C read1 1 * | 3 1
## D read1 2 * | 4 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
alignment <- GAlignments("chr1", pos=3L, cigar="1H2M1H", strand("+"), names="read1")
mapToAlignments(chrpos, alignment, style="mapped")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## C read1 1 * | 3 1
## D read1 2 * | 4 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mapToAlignments(chrpos, alignment, style="sequence")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## C read1 1 * | 3 1
## D read1 2 * | 4 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mapToAlignments(chrpos, alignment, style="readpos")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## C read1 2 * | 3 1
## D read1 3 * | 4 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
alignment <- GAlignments("chr1", pos=3L, cigar="1S2M1S", strand("+"), names="read1")
mapToAlignments(chrpos, alignment, style="mapped")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## C read1 1 * | 3 1
## D read1 2 * | 4 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mapToAlignments(chrpos, alignment, style="sequence")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## C read1 2 * | 3 1
## D read1 3 * | 4 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mapToAlignments(chrpos, alignment, style="readpos")
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | xHits alignmentsHits
## <Rle> <IRanges> <Rle> | <integer> <integer>
## C read1 2 * | 3 1
## D read1 3 * | 4 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
This looks wrong:
The reported deletion (ref pos 6 & 7) seems wrong. Should be ref pos 4 & 5.
sessionInfo():