foerstner-lab / READemption

A pipeline for the computational evaluation of RNA-Seq data
https://reademption.readthedocs.io
Other
37 stars 19 forks source link

Constant background value in coverage files #55

Closed gprezza closed 8 months ago

gprezza commented 1 year ago

Running v. 2.0.3

reademption align -p 40 --poly_a_clipping -f analysis_reademption -q -g -S
reademption coverage -f analysis_reademption -p 40

In the coverage files, a constant value seems to be added at each position of the genome, as exemplified here (top two files are fw and rev of raw coverage, bottom two are fw and rev of min normalized):

Screenshot from 2023-09-20 09-55-26

Also another user sees the same with unrelated data and Reademption run.

konrad commented 1 year ago

Thanks for reporting this, @gprezza. Did you have a look at the BAM files in a genome browser and can you exclude that this the actual coverage? It would be helpful if you could provide a screenshot of the same position after loading the BAM files.

Tillsa commented 12 months ago

@gprezza did you find the signal in the BAM files?

gprezza commented 12 months ago

Sorry, kinda slipped out of my mind.

https://github.com/foerstner-lab/READemption/assets/31476234/c1c23ddc-de67-4c8c-a4eb-dc41b696f7ac

It doesn't seem to be coming from actual reads (as also shown by the coverage plot generated by IGV from the bam file). However, there is an enormous amount of what I think are spliced reads (horizontal lines in the bam track). I guess they shouldn't be the cause of the background in the wiggle files, but I also don't think this amount of spliced reads is normal.

Btw this (background in the wig files and spliced reads in the bam file) is not restricted to this locus only and happens throughout the entire genome.

edit: actually, thinking about it, maybe the coverage generating step does take into account spliced reads as actual reads spanning the entire region? That would explain the constant background.

Tillsa commented 12 months ago

You could run the analysis without read splitting and compare with the results with the results that use read splitting. Also can you check if you have split reads, where the segments map far away from each other?

gprezza commented 11 months ago

Sorry for the late reply.

Removing split alignment does indeed solve the issue (In the image: with split alignment on top, without on the bottom. That's a view of the whole genome btw.).

However, turning off this function is not really an option for me, since I have human RNA in these samples and splicing is happening. Just to clarify, what's shown in the image is not the human genome, but a bacterial one.

Screenshot from 2023-10-06 17-25-36

From a quick look, it doesn't seem like the split reads start and end all at the same loci. Rather they seem to simply 1) be a lot 2) have in general a distance between the two halves of tens of thousands of nucleotides.

Unfortunately, I don't have the time right now to further look into this, but I'd be happy to share these data privately if you guys want to.

Tillsa commented 11 months ago

You could use the option --coverage_style when running the coverage subcommand and select either 'first base only' or 'last base only'. That means that only the first base or the last base of an alignment is counted for coverage calculation. However, you will lose some information in either case. Another possible solutions is to filter out reads "manually" from the bam file that stretch too long. You could use pysam for example and extract the start and the end of each alignment to get the total length of the alignment. Then chose a threshold that makes biologically sense and discard all reads that are above this threshold.

gprezza commented 11 months ago

Thanks for the suggestions. However, I feel like these are ways to circumvent the problem rather than solving it. I am confident that this amount of split reads is not due to actual spliced transcripts in the samples. We never see this amount of split reads in similar samples of the same organism (even when analyzed with earlier versions of reademption), so my assumption is that the split reads are somehow not being aligned properly. Discarding them as you suggest would work as a temporary solution, but equates to throwing away reads that normally would be valuable information.

Tillsa commented 11 months ago

You could take a couple of sample split reads that appear to map too far away from each other and align them once with segemehl and once with e.g. star (also a split aware aligner) and see if the mapping results are different.

gprezza commented 11 months ago

Sorry, but I really don't have the time to look at this right now; I'll be able to do it in one or two months. In the meantime, happy to share any data if you want to invistigate.

gprezza commented 9 months ago

I extracted split-aligned reads and re-aligned them with segemehl and bbmap, enabling very long read splitting in both cases. The reads with very long splits are aligned by segemehl, but generally not by bbmap, so it seems to be a difference in how the two aligners operate.

The issue of the coverage files generated by reademption however remains: why are split reads counted as full, very long reads? E.g. in my video above, compare the coverage track produced by reademption (blue track) with the coverage generated by IGV from the bam file (gray track, visible from 0:14).

Tillsa commented 9 months ago

Segemehl considers a split mapped read as one alignment. That means the the start of the alignment is the leftmost mapping position of any of the splits and the end of the alignment is the rightmost position of any of the splits. READemption builds the coverage from these start and the end positions. If the splits are very far away from each other, you will get an alignment that also has a start that is far away from its end. And READemption will produce coverage from this information. Depending on your problem that you want to solve with the coverage it might be worth to change READemptions coverage style when creating coverage files (see: https://reademption.readthedocs.io/en/latest/subcommands.html#coverage and use the option --coverage_style). Another alternative could be discarding the reads where the splits map very far away from each other, if they make no biological sense to you. You could use pysam to retrieve the read names of alignments with very long reference lengths (see: https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment.reference_length). Afterwards you can delete these reads and from your Fasta files and rerun the mapping.

gprezza commented 9 months ago

Thanks for the quick reply!

I honestly find it hard to imagine a case where it would make sense to consider split reads as full alignments and produce coverage files accordingly. Is this something introduced in the latest reademption versions? As I mentioned before, we never saw this constant background on coverage files with similar samples analyzed with an earlier version of reademption (v0.4.3).

I guess changing the --coverage_style option should solve this, but neither option is ideal. If i may advance a feature request here, another coverage option that uses all aligned base but not the indels would be great :)

Tillsa commented 9 months ago

READemption has always used the start and the end position of an alignment for calculating coverage. However, READemption version 0.4.3. used segemehl version 0.2.0 while newer versions use segemehl version 0.3.4. It might be that the internal split read algorithm has changed. Unfortunately, there is no option for segemehl to indicate how long the insert size between two splits should be. Sometimes segemehl even counts alignments with single-nucleotide mismatches as split alignments. I agree that an option to calculate coverage not only based on the start and stop position but based on actual mapped nucleotides would be beneficial for READemption and I created a feature request issue https://github.com/foerstner-lab/READemption/issues/57

gprezza commented 8 months ago

Happy new year!

I aligned the same sample with READemption version 0.4.3 and 2.0.3 (allowing split reads in both cases) and can confirm that split reads are much more frequent in v 2.0.3. It seems to derive from the different segemehl versions as you said.

I'll close this issue now; let me just stress once again that a fix for the coverage files would be much appreciated :) .