BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
207 stars 71 forks source link

Low coverage of reads mapped to isoform #129

Open wigge206 opened 4 years ago

wigge206 commented 4 years ago

I am wanting to visualise reads that map to each isoform by using the --generate_map option from collapse. The reads the supposedly support this novel isoform (225532d7-ec74-4839-9708-8c26168366cb_ENSG00000141510.17, bottom panel in image) only have partly mapped to the isoform, is this to be expected? This isoform was highly expressed and so we are wanting to design assay to validate this isoform, hence the need to visualse reads.

igv_snapshot

In case I have made an error in my code: python flair align -g hg38.fa -r sample1.fastq -v1.3 -o sample1 Repeat for all samples

python flair correct -q sample1.bed -g hg38.fa -f gencode.v29.annotation.gtf -o sample1 Repeat for all samples cat *all_corrected.bed > all_samples_corrected.bed

python flair collapse -q all_samples_corrected.bed -r sample1.fastq sample2.fastq sample3.fastq -f gencode.v29.annotation.gtf --generate_map

To pull out reads and visualise in IGV I matched the reads IDs to the original alignments files for each sample. grep 225532d7-ec74-4839-9708-8c26168366cb_ENSG00000141510.17 allsamples_aligned.isoform.read.map.txt |sed 's/,/\n/g' | sed '1d' > 225532d7-ec74-4839-9708-8c26168366cb_ENSG00000141510.17.reads.map.txt samtools view sample1.bam | grep -f 225532d7-ec74-4839-9708-8c26168366cb_ENSG00000141510.17.reads.map.txt > subset.sam samtools view -H sample1.bam > header cat header subset.sam > subset_sample1.sam samtools view -bS subset_sample1.sam > subset_sample1.bam samtools sort subset_sample1.bam -o subsetSort_sample1.bam samtools index subsetSort_sample1.bam

ENSG00000141510 17_isoforms

ENSG00000141510 17_usage

belgravia commented 4 years ago

Your commands look good and that is what I would expect to happen -- shorter reads can support a longer isoform. The idea is that if the truncated read still covers a part of the isoform that is unique to that isoform, then it can be considered a supporting read for that isoform.

The way we have implemented this is only a simple check of mapping quality -- if the mapq from minimap2 is 0, then read mapped ambiguously to multiple isoforms and is not counted. If a read aligns to an isoform with mapq of 1 or greater, then that read is assigned to that isoform. 1 is not a conservative threshold -- I would recommend that you either try a higher --quality threshold (you could try 10, or 30) or try running with collapse with --stringent. Stringent enforces that the supporting reads need to cover 80% of an isoform's nucleotides, including some bases of the first and last exon.

Let me know if you have further questions and thanks for your patience.

-Alison

wigge206 commented 4 years ago

I had tried --stringent with little change, increasing the quality threshold did changes the number of reads mapped etc but not as expected. The FL known isoform (dark blue) lost the majority of its reads, whilst the novel isoform (orange track in IGV, green in FLAIR isoform.png), which, contains all the same exon boundaries as the FL with exception of the inclusion of the entire intron 4, retained its reads.

image

The concern is that when assessing the alignments the vast majority of reads do not contain bases that overlap the intron inclusion (the only unique region compared to FL). The vast majority of reads mapped to the novel isoform would map equally well to the FL isoform.

wigge206 commented 3 years ago

Hi Alison, I thought I would add more detail as I have attempted a few more things but still are having issues. Firstly, the sequencing was preformed on two PCR amplicons. One amplicon is generated by primers in the first and last exon and represents the FL (and FL splice isoforms). The second amplicon is generated by a primer in a known internal start site and the same reverse primer. These are well described splicing changes and the delta isoform is represented gencode and the gtf file I have been using.

The issue is the reads coming from the delta isoform are being used as evidence for a FL isoform with an intron inclusion event rather then being mapped to the known delta isoforms. I have tried to split these reads out by extracting any read that was entirely contained between the two delta primers (with a small buffer). I was able to split the reads (not perfectly), however, after running flair on these reads they were now mapped to a different FL isoform despite very few reads being mapped to the first three exon. There was no evidence of the delta isoforms.

Is this an issue of flair? Whereby, internal start codons are miscalled/mapped to longer isoforms?