Closed SarahNadeau closed 6 months ago
Dear @SarahNadeau
I was suprised there are only 107234 reads in transcript_model_reads.tsv. Why would some reads not be in this output?
In some cases there is not enough support for the transcript to be reported as reliable. E.g. if a novel transcript is supported only by a few, it will be discarded, and these reads will not be used.
I was also surprised that a pivot table of OUT.transcript_model_reads.tsv shows reads are assigned to many different transcripts, but only one is reported in OUT.transcript_model_grouped_counts.tsv. The reported transcript this is not even the one with the most read support (see screenshots below).
This is indeed surprising, could you share non-grouped counts?
Best Andrey
Hi @andrewprzh,
Thanks for the reply! The non-grouped counts also only include the one reported transcript. In case it helps, I've also included a copy of the log file.
Just fyi I redacted some of the paths etc. in the log file, I hope that's not a problem.
Dear @andrewprzh,
I think I found the reason for these weird results. I played around with the source code, commenting out the lines that filter out similar and not-well-supported novel transcripts and uncommenting the lines that log why they would be filtered out.
I can now see that there is an unreported transcript (number 9) that is discarded for being very similar to the transcript with the highest read support. For this reason, the unique reads are quite low for the transcript with the highest read support and thus the count values are also low compared to the transcript with the third highest number of reads supporting it.
Based on these observations, would it make sense to re-calculate the unique reads assigned to a transcript after filtering out similar and not-well-supported transcripts? That way the counts are representative of the read support.
Attached are some supporting materials so you can see what I mean. I apologize that I don't have a small reproducible example and that also these transcript and read/count numbers differ from my original example because I started using only a subset of the reads in order to iterate faster.
./IsoQuant/isoquant.py \
--bam results/align_reads/OUT/aux/OUT_barcode_01.filtered.bam \
results/align_reads/OUT/aux/OUT_barcode_02.filtered.bam \
results/align_reads/OUT/aux/OUT_barcode_03.filtered.bam \
-r data/references/ref.fa.gz \
-g data/references/ref_coords_corrected.gtf \
-d nanopore \
--transcript_quantification unique_only \
--debug \
-o results/isoquant
2023-12-21 14:58:38,116 - DEBUG - Novel model transcript9.NC_000072.7.nnic has a similar isoform transcript7.NC_000072.7.nnic
2023-12-21 14:58:38,119 - DEBUG - Novel model transcript11.NC_000072.7.nnic has coverage 480 < 534.46, component cov = 26723
2023-12-21 14:58:38,144 - DEBUG - Novel model transcript25.NC_000072.7.nnic has a similar isoform transcript7.NC_000072.7.nnic
2023-12-21 14:58:38,151 - DEBUG - Novel model transcript38.NC_000072.7.nnic has a similar isoform transcript23.NC_000072.7.nnic
2023-12-21 14:58:38,153 - DEBUG - Novel model transcript44.NC_000072.7.nnic has coverage 9 < 534.46, component cov = 26723
2023-12-21 14:58:38,155 - DEBUG - Novel model transcript49.NC_000072.7.nnic has coverage 5 < 534.46, component cov = 26723
Dear @SarahNadeau
Wow, thanks a lot for sharing those insights! It's really cool that you wend down to looks at the code, especially taking into account that transcript discovery part is not that well structured.
I totally agree about recalculating the counts! Could you provide the exact lines you were modifying? I probably have a clue which ones are you talking about, but there a few filtering steps so I'd like to be sure :)
Best Andrey
Hi @andrewprzh,
The changes are in the filter_transcripts
function in src/graph_based_model_construction.py
: https://github.com/SarahNadeau/IsoQuant/blob/master/src/graph_based_model_construction.py#L201.
Here is the git diff:
diff --git a/src/graph_based_model_construction.py b/src/graph_based_model_construction.py
index 8b7b382..0b9f3ec 100644
--- a/src/graph_based_model_construction.py
+++ b/src/graph_based_model_construction.py
@@ -220,24 +220,24 @@ class GraphBasedModelConstructor:
self.params.min_novel_count_rel * component_coverage)
if model.transcript_id in to_substitute:
- #logger.debug("Novel model %s has a similar isoform %s" % (model.transcript_id, to_substitute[model.transcript_id]))
+ logger.debug("Novel model %s has a similar isoform %s" % (model.transcript_id, to_substitute[model.transcript_id]))
self.transcript_read_ids[to_substitute[model.transcript_id]] += self.transcript_read_ids[model.transcript_id]
- del self.transcript_read_ids[model.transcript_id]
+ # del self.transcript_read_ids[model.transcript_id]
continue
if self.internal_counter[model.transcript_id] < novel_isoform_cutoff:
- #logger.debug("Novel model %s has coverage %d < %.2f, component cov = %d" % (model.transcript_id,
- # self.internal_counter[model.transcript_id],
- # novel_isoform_cutoff, component_coverage))
- del self.transcript_read_ids[model.transcript_id]
+ logger.debug("Novel model %s has coverage %d < %.2f, component cov = %d" % (model.transcript_id,
+ self.internal_counter[model.transcript_id],
+ novel_isoform_cutoff, component_coverage))
+ # del self.transcript_read_ids[model.transcript_id]
continue
if len(model.exon_blocks) <= 2:
mapq = self.mapping_quality(model)
- #logger.debug("Novel model %s has quality %.2f" % (model.transcript_id, mapq))
+ logger.debug("Novel model %s has quality %.2f" % (model.transcript_id, mapq))
if mapq < self.params.simple_models_mapq_cutoff:
- #logger.debug("Novel model %s has poor quality" % model.transcript_id)
- del self.transcript_read_ids[model.transcript_id]
+ logger.debug("Novel model %s has poor quality" % model.transcript_id)
+ # del self.transcript_read_ids[model.transcript_id]
continue
# TODO: correct ends for known
Dear @SarahNadeau
Thanks a lot! Will get my hands on it!
Best Andrey
Should be now fixed in IsoQuant 3.4. Read assignment is completely reworked now. Thank you for the report!
Hello,
I am a bit confused by my isoquant results. I analyzed three BAM files with combined 109932 mappings (109931 primary, 1 supplementary). The log shows that all 109931 reads were classified as intergenic and have polyAs (this is what I expect, we added polyAs to the alignment).
Questions
I was suprised there are only 107234 reads in
transcript_model_reads.tsv
. Why would some reads not be in this output?I was also surprised that a pivot table of
OUT.transcript_model_reads.tsv
shows reads are assigned to many different transcripts, but only one is reported inOUT.transcript_model_grouped_counts.tsv
. The reported transcript this is not even the one with the most read support (see screenshots below).Supporting files
isoquant_transcript_model_reads_counts.zip
Pivot table of the 107234 reads in
transcript_model_reads.tsv
:OUT.transcript_model_grouped_counts.tsv
:Command line (run as part of a snakemake workflow):