Bioconductor / GenomicAlignments

Representation and manipulation of short genomic alignments
https://bioconductor.org/packages/GenomicAlignments
8 stars 15 forks source link

mapFromAlignments() and soft-clipping #12

Open hpages opened 4 years ago

hpages commented 4 years ago

This looks wrong:

library(GenomicAlignments)
mypos <- IRanges(1:7, width=1, names=rep("read1", 7))
alignment <- GAlignments("chr1", pos=1L, cigar="2S3M2D7M", strand("+"), names="read1")
mapFromAlignments(mypos, alignment)
# IRanges object with 7 ranges and 0 metadata columns:
#             start       end     width
#         <integer> <integer> <integer>
#   read1         1         1         1
#   read1         2         2         1
#   read1         3         3         1
#   read1         4         4         1
#   read1         5         5         1
#   read1         8         8         1
#   read1         9         9         1

The reported deletion (ref pos 6 & 7) seems wrong. Should be ref pos 4 & 5.

sessionInfo():

> sessionInfo()
R Under development (unstable) (2019-10-30 r77336)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.r77336/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r77336/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] GenomicAlignments_1.23.1    Rsamtools_2.3.4            
 [3] Biostrings_2.55.4           XVector_0.27.0             
 [5] SummarizedExperiment_1.17.2 DelayedArray_0.13.4        
 [7] BiocParallel_1.21.2         matrixStats_0.55.0         
 [9] Biobase_2.47.2              GenomicRanges_1.39.2       
[11] GenomeInfoDb_1.23.13        IRanges_2.21.3             
[13] S4Vectors_0.25.12           BiocGenerics_0.33.0        

loaded via a namespace (and not attached):
 [1] lattice_0.20-38        crayon_1.3.4           bitops_1.0-6          
 [4] grid_4.0.0             zlibbioc_1.33.1        Matrix_1.2-18         
 [7] tools_4.0.0            RCurl_1.98-1.1         compiler_4.0.0        
[10] GenomeInfoDbData_1.2.2
laurenfitch commented 4 years ago

Hi Herve, I don't know if this is related but I've seen some more odd behavior while using the package today. Please see the example below. The first 2 positions correspond to the 2 Xs. mapFromAlignments reduced the width of the second range to 0.

> mypos <- IRanges(c(52, 67, 68), width = 1, names=rep("read1", 3))
> mapFromAlignments(mypos, GAlignments("chr1", pos=1L, cigar="27S39=1I11=1X14=1X5=", strand("+"), names="read1"))
IRanges object with 3 ranges and 0 metadata columns:
            start       end     width
        <integer> <integer> <integer>
  read1        52        52         1
  read1        67        66         0
  read1        67        67         1
gaoce commented 1 year ago

The issue is at https://github.com/Bioconductor/GenomicAlignments/blob/4237abd8fa0086ceab25d07a68eba1c3187c2c43/src/coordinate_mapping_methods.c#L169-L223

In particular https://github.com/Bioconductor/GenomicAlignments/blob/4237abd8fa0086ceab25d07a68eba1c3187c2c43/src/coordinate_mapping_methods.c#L198-L201

Although query_loc is supposed to have trimmed positions on the query, i.e., after soft clipping, this block of code still consume the clipping on the read.

To demonstrate, if you remove the clipping from the alignment, everything looks fine

mypos <- IRanges(1:7, width=1, names=rep("read1", 7))
alignment <- GAlignments("chr1", pos=1L, cigar="3M2D7M", strand("+"), names="read1")
mapFromAlignments(mypos, alignment)
IRanges object with 7 ranges and 0 metadata columns:
            start       end     width
        <integer> <integer> <integer>
  read1         1         1         1
  read1         2         2         1
  read1         3         3         1
  read1         6         6         1
  read1         7         7         1
  read1         8         8         1
  read1         9         9         1

If you increase the clipping size, even the last 2 positions are wrong:

mypos <- IRanges(1:7, width=1, names=rep("read1", 7))
alignment <- GAlignments("chr1", pos=1L, cigar="7S3M2D7M", strand("+"), names="read1")
mapFromAlignments(mypos, alignment)
IRanges object with 7 ranges and 0 metadata columns:
            start       end     width
        <integer> <integer> <integer>
  read1         1         1         1
  read1         2         2         1
  read1         3         3         1
  read1         4         4         1
  read1         5         5         1
  read1         6         6         1
  read1         7         7         1

Also, soft clipping on the other side is also an issue:

mypos <- IRanges(10:11, width=1, names=rep("read1", 2))
alignment <- GAlignments("chr1", pos=1L, cigar="3M2D7M2S", strand("+"), names="read1")
mapFromAlignments(mypos, alignment)
IRanges object with 2 ranges and 0 metadata columns:
            start       end     width
        <integer> <integer> <integer>
  read1        12        12         1
  read1        13        13         1

One way to fix this is to change this function (to_ref) to not consume the soft clipped bases.

hpages commented 1 year ago

Hi @gaoce ,

Thanks for taking the time to look at the code. Would you be willing to submit a PR? Problem is that I'm not familiar with the implementation of mapFromAlignments() (the function was implemented many years ago by someone who's left the Bioconductor sphere), and I never really found the time to familiarize myself with it.

Best, H.

gaoce commented 1 year ago

Hi @hpages,

Sure. I will find some time to work on it.

fedxa commented 9 months ago

Hi @gaoce and @hpages,

I've bumped in the same problem identified by @hpages. I'd like to point out some further subtleties in mapFrom and mapTo with soft/hard clips. While the genome coordinates are uniquely defined for an alignment, the coordinates within the alignement itself can be defined in several ways: within the mapped part of the read without clipped parts, or counting the soft (or hard) clipped parts also. Note, that both options can be useful depending on the question asked. For example, for soft clipped reads the whole sequence (and qualities) is stored in the record, so indexing into it requires the coordinates including the clipped parts. In case of hard clipped alignments usually discarding the clipped parts is sensible, but for liftover like tasks, when alignment is not a read but an alternative shifted genome, it is required to get coordinates in the alignment including hard clips (this is the case with minimap2 assembly to assembly mapping).

For example, for hard clipped alignment GAlignments("chr1", pos=3L, cigar="1H2M1H", strand("+"), names="read1") The possible mappings can be

Mapped Read portion Pos      1 2
Read seqeuence Pos           1 2
Read position                2 3  
Read CIGAR                 H M M H
chr1 pos                 1 2 3 4 5 6

And for the soft clipped case GAlignments("chr1", pos=3L, cigar="1S2M1S", strand("+"), names="read1")

Mapped Read portion Pos      1 2
Read seqeuence Pos           2 3 
Read position                2 3  
Read CIGAR                 S M M S
chr1 pos                 1 2 3 4 5 6

To add to the confusion, in the current implementation mapFromAlignments uses the convention "mapped read position", and mapToAlignments uses "read sequence position"... This makes these two operations not to be each other inverse in case of the soft clipped alignments.

I think that a good option is to add an additional parameter style = mapped|sequence|readpos to the mapping functions, which would select one of the three options, with the default being the current implementation (mapped for mapFromAlignments and sequence for mapToAlignments). Then the existing code should not break, while allowing for all options to be selected for each use case. I am adding a pull request #35 that implements this, and fixes the original bug mentioned in the issue.

fedxa commented 9 months ago

So, the suggested behaviour should be like this

readpos <- GRanges("r1",IRanges(1:6, width=1, names=rep("read1",6)))
chrpos <- GRanges("chr1",IRanges(1:6, width=1, names=LETTERS[1:6]))

Suggested mapFromAlignments behaviour

In this case "mapped" is the default if called without style parameter

alignment <- GAlignments("chr1", pos=3L, cigar="2M", strand("+"), names="read1")
mapFromAlignments(readpos, alignment)

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         1              1
##   read1     chr1         4      * |         2              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

alignment <- GAlignments("chr1", pos=3L, cigar="1H2M1H", strand("+"), names="read1")
mapFromAlignments(readpos, alignment, style="mapped")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         1              1
##   read1     chr1         4      * |         2              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapFromAlignments(readpos, alignment, style="sequence")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         1              1
##   read1     chr1         4      * |         2              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapFromAlignments(readpos, alignment, style="readpos")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         2              1
##   read1     chr1         4      * |         3              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

alignment <- GAlignments("chr1", pos=3L, cigar="1S2M1S", strand("+"), names="read1")
mapFromAlignments(readpos, alignment, style="mapped")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         1              1
##   read1     chr1         4      * |         2              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapFromAlignments(readpos, alignment, style="sequence")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         2              1
##   read1     chr1         4      * |         3              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapFromAlignments(readpos, alignment, style="readpos")

## GRanges object with 2 ranges and 2 metadata columns:
##         seqnames    ranges strand |     xHits alignmentsHits
##            <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   read1     chr1         3      * |         2              1
##   read1     chr1         4      * |         3              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Suggested mapToAlignments behaviour

In this case "sequence" is the default if called without style parameter

alignment <- GAlignments("chr1", pos=3L, cigar="2M", strand("+"), names="read1")
mapToAlignments(chrpos, alignment)

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         1      * |         3              1
##   D    read1         2      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

alignment <- GAlignments("chr1", pos=3L, cigar="1H2M1H", strand("+"), names="read1")
mapToAlignments(chrpos, alignment, style="mapped")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         1      * |         3              1
##   D    read1         2      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapToAlignments(chrpos, alignment, style="sequence")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         1      * |         3              1
##   D    read1         2      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapToAlignments(chrpos, alignment, style="readpos")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         2      * |         3              1
##   D    read1         3      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

alignment <- GAlignments("chr1", pos=3L, cigar="1S2M1S", strand("+"), names="read1")
mapToAlignments(chrpos, alignment, style="mapped")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         1      * |         3              1
##   D    read1         2      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapToAlignments(chrpos, alignment, style="sequence")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         2      * |         3              1
##   D    read1         3      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mapToAlignments(chrpos, alignment, style="readpos")

## GRanges object with 2 ranges and 2 metadata columns:
##     seqnames    ranges strand |     xHits alignmentsHits
##        <Rle> <IRanges>  <Rle> | <integer>      <integer>
##   C    read1         2      * |         3              1
##   D    read1         3      * |         4              1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths