Read sequences don't have overlapping with gene, but eachoverlap() reported there're.

Xiao-Zhong commented 1 month ago

thumbnail_image

As shown, the gene "ENSMUSG00000080478.1" wasn't mapped by any read, though there're reads mapping to a nearby multi-exon gene "ENSMUSG00000041560.12". In fact, the first gene is located in one intron of the second.

Please help address the issue. Thanks!

kescobo commented 1 month ago

Copying from Slack

jbieler Today at 3:16 AM Hum, I think the issue is that those reads do overlap the gene (they start before and end after) but do not cover it because of the gaps in the alignement. You need to compute some kind of coverage using the reads alignement.

jbieler Today at 3:34 AM Something like this should do it :
aln = XAM.BAM.alignment(read)

# check if the read match any position in the target interval
for position in ...
    BioAlignments.ref2seq(aln, position)[2] == BioAlignments.OP_MATCH #it covers
end
(edited)

Xiao Zhong Today at 4:35 AM Agree. eachoverlap() might only compare the positions of read and feature ends. It should be ok for most DNA read alignments, but will have the issue above caused by gaps like split reads from RNA sequencing. I got alignment ranges with BAM.position(record) and BAM.rightposition(record), will compare them with features (genes, exons, introns, etc). I'm new with XAM.jl. Thank you!

jbieler Today at 4:58 AM Look into ref2seq & seq2ref too, they allow you to check if a reference position is covered by a read, or where a position within the read is mapping to in the reference. (edited)

Xiao-Zhong commented 1 month ago

Hi Kevin,

Unfortunately, I haven't solved the issue: how to check overlapping between a feature and the sequences aligned (or check if a feature is located within gaps if there're).

Why I'm interested in this, if a feature isn't really aligned by reads, these reads are not from the feature.

I may have to parse CIGAR string to get the coordinates of alignment blocks. Dose XAM.jl (or other package) have a function for this? Thanks!

BYW, I've asked the question on Slack. I commented here for others who will probably encounter the issue as well.

kescobo commented 1 month ago

We have some CIGAR functionality in BioAlignments.jl and some in XAM.jl. @jakobnissen recently started https://github.com/BioJulia/CIGARStrings.jl, but I think that's just a prototype

jonathanBieler commented 1 month ago

BAM.alignment and Alignment's methods should give you all you need, it's already parsing the cigar internally, as you can see in the list of anchors :

aln = BAM.alignment(r)
BioAlignments.Alignment:
  aligned range:
    seq: 0-141
    ref: 198265943-198266049
  CIGAR string: 77M1I2M1I3M2D22M35S

julia> aln.anchors
9-element Vector{BioAlignments.AlignmentAnchor}:
 AlignmentAnchor(0, 198265943, 0, '0')
 AlignmentAnchor(77, 198266020, 77, 'M')
 AlignmentAnchor(78, 198266020, 78, 'I')
 AlignmentAnchor(80, 198266022, 80, 'M')
 AlignmentAnchor(81, 198266022, 81, 'I')
 AlignmentAnchor(84, 198266025, 84, 'M')
 AlignmentAnchor(84, 198266027, 86, 'D')
 AlignmentAnchor(106, 198266049, 108, 'M')
 AlignmentAnchor(141, 198266049, 143, 'S')

Xiao-Zhong commented 1 month ago

OK. Thanks! The Alignment Operation below is what I wanted.

BioAlignments.OP_SKIP — Constant

'N': (typically long) deletion from the reference, e.g. due to RNA splicing

BioJulia / XAM.jl

Read sequences don't have overlapping with gene, but eachoverlap() reported there're. #75