Why are novel transcripts with most reads assigned not reported?

SarahNadeau commented 11 months ago

Hello,

I am a bit confused by my isoquant results. I analyzed three BAM files with combined 109932 mappings (109931 primary, 1 supplementary). The log shows that all 109931 reads were classified as intergenic and have polyAs (this is what I expect, we added polyAs to the alignment).

Questions

I was suprised there are only 107234 reads in transcript_model_reads.tsv. Why would some reads not be in this output?
I was also surprised that a pivot table of OUT.transcript_model_reads.tsv shows reads are assigned to many different transcripts, but only one is reported in OUT.transcript_model_grouped_counts.tsv. The reported transcript this is not even the one with the most read support (see screenshots below).

Supporting files

isoquant_transcript_model_reads_counts.zip

Pivot table of the 107234 reads in transcript_model_reads.tsv: Screenshot 2023-12-14 at 10 33 15

OUT.transcript_model_grouped_counts.tsv: Screenshot 2023-12-14 at 10 34 09

Command line (run as part of a snakemake workflow):

isoquant.py \
    --reference {input.reference} \
    --genedb {input.annotations} \
    --data_type nanopore \
    --transcript_quantification unique_only \
    --model_construction_strategy all \
    --stranded forward \
    --count_exons \
    --bam {input.bam} \
    --data_type nanopore \
    --threads {threads} \
    -o $WORKDIR

andrewprzh commented 11 months ago

Dear @SarahNadeau

I was suprised there are only 107234 reads in transcript_model_reads.tsv. Why would some reads not be in this output?

In some cases there is not enough support for the transcript to be reported as reliable. E.g. if a novel transcript is supported only by a few, it will be discarded, and these reads will not be used.

I was also surprised that a pivot table of OUT.transcript_model_reads.tsv shows reads are assigned to many different transcripts, but only one is reported in OUT.transcript_model_grouped_counts.tsv. The reported transcript this is not even the one with the most read support (see screenshots below).

This is indeed surprising, could you share non-grouped counts?

Best Andrey

SarahNadeau commented 11 months ago

Hi @andrewprzh,

Thanks for the reply! The non-grouped counts also only include the one reported transcript. In case it helps, I've also included a copy of the log file.

isoquant_log_transcript_model_counts.zip

SarahNadeau commented 11 months ago

Just fyi I redacted some of the paths etc. in the log file, I hope that's not a problem.

SarahNadeau commented 11 months ago

Dear @andrewprzh,

I think I found the reason for these weird results. I played around with the source code, commenting out the lines that filter out similar and not-well-supported novel transcripts and uncommenting the lines that log why they would be filtered out.

I can now see that there is an unreported transcript (number 9) that is discarded for being very similar to the transcript with the highest read support. For this reason, the unique reads are quite low for the transcript with the highest read support and thus the count values are also low compared to the transcript with the third highest number of reads supporting it.

Based on these observations, would it make sense to re-calculate the unique reads assigned to a transcript after filtering out similar and not-well-supported transcripts? That way the counts are representative of the read support.

Attached are some supporting materials so you can see what I mean. I apologize that I don't have a small reproducible example and that also these transcript and read/count numbers differ from my original example because I started using only a subset of the reads in order to iterate faster.

Command line:

./IsoQuant/isoquant.py \
   --bam results/align_reads/OUT/aux/OUT_barcode_01.filtered.bam \
         results/align_reads/OUT/aux/OUT_barcode_02.filtered.bam \
         results/align_reads/OUT/aux/OUT_barcode_03.filtered.bam \
   -r data/references/ref.fa.gz \
   -g data/references/ref_coords_corrected.gtf \
   -d nanopore \
   --transcript_quantification unique_only \
   --debug \
   -o results/isoquant

Selected lines from isoquant.log:

2023-12-21 14:58:38,116 - DEBUG - Novel model transcript9.NC_000072.7.nnic has a similar isoform transcript7.NC_000072.7.nnic
2023-12-21 14:58:38,119 - DEBUG - Novel model transcript11.NC_000072.7.nnic has coverage 480 < 534.46, component cov = 26723
2023-12-21 14:58:38,144 - DEBUG - Novel model transcript25.NC_000072.7.nnic has a similar isoform transcript7.NC_000072.7.nnic
2023-12-21 14:58:38,151 - DEBUG - Novel model transcript38.NC_000072.7.nnic has a similar isoform transcript23.NC_000072.7.nnic
2023-12-21 14:58:38,153 - DEBUG - Novel model transcript44.NC_000072.7.nnic has coverage 9 < 534.46, component cov = 26723
2023-12-21 14:58:38,155 - DEBUG - Novel model transcript49.NC_000072.7.nnic has coverage 5 < 534.46, component cov = 26723

Screenshot 2023-12-21 at 15 15 04

transcript_model_counts_and_reads.zip

andrewprzh commented 10 months ago

Dear @SarahNadeau

Wow, thanks a lot for sharing those insights! It's really cool that you wend down to looks at the code, especially taking into account that transcript discovery part is not that well structured.

I totally agree about recalculating the counts! Could you provide the exact lines you were modifying? I probably have a clue which ones are you talking about, but there a few filtering steps so I'd like to be sure :)

Best Andrey

SarahNadeau commented 10 months ago

Hi @andrewprzh,

The changes are in the filter_transcripts function in src/graph_based_model_construction.py: https://github.com/SarahNadeau/IsoQuant/blob/master/src/graph_based_model_construction.py#L201.

Here is the git diff:

diff --git a/src/graph_based_model_construction.py b/src/graph_based_model_construction.py
index 8b7b382..0b9f3ec 100644
--- a/src/graph_based_model_construction.py
+++ b/src/graph_based_model_construction.py
@@ -220,24 +220,24 @@ class GraphBasedModelConstructor:
                                            self.params.min_novel_count_rel * component_coverage)

             if model.transcript_id in to_substitute:
-                #logger.debug("Novel model %s has a similar isoform %s" % (model.transcript_id, to_substitute[model.transcript_id]))
+                logger.debug("Novel model %s has a similar isoform %s" % (model.transcript_id, to_substitute[model.transcript_id]))
                 self.transcript_read_ids[to_substitute[model.transcript_id]] += self.transcript_read_ids[model.transcript_id]
-                del self.transcript_read_ids[model.transcript_id]
+                # del self.transcript_read_ids[model.transcript_id]
                 continue

             if self.internal_counter[model.transcript_id] < novel_isoform_cutoff:
-                #logger.debug("Novel model %s has coverage %d < %.2f, component cov = %d" % (model.transcript_id,
-                #                                                        self.internal_counter[model.transcript_id],
-                #                                                        novel_isoform_cutoff, component_coverage))
-                del self.transcript_read_ids[model.transcript_id]
+                logger.debug("Novel model %s has coverage %d < %.2f, component cov = %d" % (model.transcript_id,
+                                                                       self.internal_counter[model.transcript_id],
+                                                                       novel_isoform_cutoff, component_coverage))
+                # del self.transcript_read_ids[model.transcript_id]
                 continue

             if len(model.exon_blocks) <= 2:
                 mapq = self.mapping_quality(model)
-                #logger.debug("Novel model %s has quality %.2f" % (model.transcript_id, mapq))
+                logger.debug("Novel model %s has quality %.2f" % (model.transcript_id, mapq))
                 if mapq < self.params.simple_models_mapq_cutoff:
-                    #logger.debug("Novel model %s has poor quality" % model.transcript_id)
-                    del self.transcript_read_ids[model.transcript_id]
+                    logger.debug("Novel model %s has poor quality" % model.transcript_id)
+                    # del self.transcript_read_ids[model.transcript_id]
                     continue

             # TODO: correct ends for known

andrewprzh commented 10 months ago

Dear @SarahNadeau

Thanks a lot! Will get my hands on it!

Best Andrey

andrewprzh commented 6 months ago

Should be now fixed in IsoQuant 3.4. Read assignment is completely reworked now. Thank you for the report!

ablab / IsoQuant