inconsistency between transcript_tpm.tsv and transcript_model_tpm.tsv

ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)

https://ablab.github.io/IsoQuant/

Other

153 stars 13 forks source link

inconsistency between transcript_tpm.tsv and transcript_model_tpm.tsv #248

Open Mangosteen24 opened 1 month ago

Mangosteen24 commented 1 month ago

Hi Thank you for developing this useful tool!

I would like to inquire about the differences between transcript_tpm.tsv and transcript_model_tpm.tsv. The file transcript_model_tpm.tsv contains the expression of discovered transcript models in TPM (and corresponds to transcript_models.gtf). It should include all expressed transcripts, both novel and known, correct? However, when I search for a specific transcript, such as the canonical transcript ENST00000275493, I find it only exists in transcript_tpm.tsv with a high TPM value, while it is absent from transcript_model_tpm.tsv. I understand that those are two different algorithms: reference-based and discovery, but it is a bit weird that the ENST00000275493 is completely absent from transcript_model_tpm.tsv. Any reason for this inconsistency?

Smilarly, ENST00000275493 is absent from transcript_models.gtf. So, would you recommend using transcript_models.gtf or extended_annotation.gtf for downstream analyses, such as SQANTI3?

Thank you!

andrewprzh commented 1 month ago

Dear @Mangosteen24

Thanks you for the feedback!

You understanding is correct, and this is indeed a little bit odd. To understand where this inconsistency stems from one has to go deeper in the algorithms and data.

A few questions do you use. Which version do you use? Some of the inconsistencies were fixed at some point, but I cannot guarantee all of the are eliminated. What reads are assigned to these isoforms in .read_assignments.tsv.gz and what are their assignment types? Where do these reads go in .transcript_model_reads.tsv.gz?

Best Andrey

Mangosteen24 commented 1 month ago

Hi Andrey I use the latest version of isoquant v3.6.1

First I checked what reads were assigned to ENST00000275493 and their assignment_type in OUT.read_assignments.tsv.gz

1165 ambiguous
659 inconsistent
13394 inconsistent_ambiguous
20156 inconsistent_non_intronic
10032 unique
19 unique_minor_difference

Then I checked those 10032 unique read_id in OUT.transcript_model_reads.tsv.gz

9049 *
 219 ENST00000344576
  28 ENST00000450046
   3 ENST00000485503
 236 transcript13328.chr7.nic
  56 transcript13334.chr7.nnic
  21 transcript13336.chr7.nic
 108 transcript13359.chr7.nic
 193 transcript13376.chr7.nic
 101 transcript13394.chr7.nic
   3 transcript13436.chr7.nic
   3 transcript13455.chr7.nic
   1 transcript13579.chr7.nnic
 804 transcript13776.chr7.nnic
   9 transcript13842.chr7.nnic
 621 transcript13858.chr7.nnic
 142 transcript13993.chr7.nnic
   3 transcript13995.chr7.nic

It seems that most unique reads were assigned to * instead of ENST00000275493, which means it is not a known transcript, or NIC or NNIC? I also noticed that most of the unique reads' assignment events were 'mono_exonic'; maybe that's the reason it cannot differentiate which isoform they come from?

andrewprzh commented 1 month ago

Thank you for providing the insights!

I think that's what happens here.

These reads map to some unique parts of the ENST00000275493 isoform (i.e. some exons that are only present in this transcript), and thus are uniquely assigned. However, to report a transcript model during transcript discovery IsoQuant need full splice matching reads. I suspect it was not reported specifically because there are none. Thus, all these reads were not assigned to any isoform (*) or got assigned to something else.

I can a further look if you have a chance to send me a BAM file with all the reads from this region or reads assigned to ENST00000275493 (all types).

Best Andrey

Mangosteen24 commented 3 weeks ago

Thank you @andrewprzh . I have a follow-up question. Since I have a huge dataset, around 40 BAM files, I want to run them simultaneously to obtain a single annotation file. This way, each sample will have the same ID for NIC/NNIC for comparison. However, the process seems to be stuck after running for 24 hours. In the log file, the last line is from a few days ago: 2024-10-26 - INFO - Finished processing chromosome chr10.. Here is my command. Could you kindly suggest a solution?

isoquant.py --data_type nanopore --yaml /home/dataset.yaml --complete_genedb --genedb /home/refdata-gex-GRCh38-2020-A/genes/genes.gtf --reference /home/refdata-gex-GRCh38-2020-A/fasta/genome.fa --output /home/isoquant_exp --read_group tag:CB --sqanti_output --threads 32

dataset.yaml 
[
    data format: "bam",
    {
        name: "exp",
        long read files: [
            "/home/tagged1.bam",
            "/home/tagged2.bam",
            "/home/tagged3.bam",
            "/home/tagged4.bam",
            "/home/tagged5.bam",
            "/home/tagged6.bam",
            "/home/tagged7.bam",
            "/home/tagged8.bam",
            "/home/tagged9.bam",
            "/home/tagged10.bam",
            "/home/tagged11.bam",
            "/home/tagged12.bam",
            "/home/tagged13.bam",
            "/home/tagged14.bam",
            "/home/tagged15.bam",
            "/home/tagged16.bam",
            "/home/tagged17.bam",
            "/home/tagged18.bam",
            "/home/tagged19.bam",
            "/home/tagged20.bam",
            "/home/tagged21.bam",
            "/home/tagged22.bam",
            "/home/tagged23.bam",
            "/home/tagged24.bam",
            "/home/tagged25.bam",
            "/home/tagged26.bam",
            "/home/tagged27.bam",
            "/home/tagged28.bam",
            "/home/tagged29.bam",
            "/home/tagged30.bam",
            "/home/tagged31.bam",
            "/home/tagged32.bam",
            "/home/tagged33.bam",
            "/home/tagged34.bam",
            "/home/tagged35.bam",
            "/home/tagged36.bam",
            "/home/tagged37.bam"
        ],
        labels: [
            "Replicate1",
            "Replicate2",
            "Replicate3",
            "Replicate4",
            "Replicate5",
            "Replicate6",
            "Replicate7",
            "Replicate8",
            "Replicate9",
            "Replicate10",
            "Replicate11",
            "Replicate12",
            "Replicate13",
            "Replicate14",
            "Replicate15",
            "Replicate16",
            "Replicate17",
            "Replicate18",
            "Replicate19",
            "Replicate20",
            "Replicate21",
            "Replicate22",
            "Replicate23",
            "Replicate24",
            "Replicate25",
            "Replicate26",
            "Replicate27",
            "Replicate28",
            "Replicate29",
            "Replicate30",
            "Replicate31",
            "Replicate32",
            "Replicate33",
            "Replicate34",
            "Replicate35",
            "Replicate36",
            "Replicate37"
        ]
    }
]

andrewprzh commented 3 weeks ago

@Mangosteen24

Could you send me the entire log file? Sometime reading many BAM files in parallel may result in a slower performance, merging BAMs into a single one and using tags/TSV file to group counts may possibly be a faster solution.

Best Andrey

Mangosteen24 commented 2 weeks ago

I tried again with 27 samples; however the server shut down unexpectedly. At the time, most chromosomes seemed to have finished processing except for four: chr1, chr7, chr11, and chrM. Details are shown in the attached log1.

isoquant.py --data_type nanopore --yaml /home/Work/isoquant_27samples/dataset.yaml --complete_genedb --genedb /home/_shared/database/refdata-gex-GRCh38-2020-A/genes/genes.gtf --reference /home/_shared/database/refdata-gex-GRCh38-2020-A/fasta/genome.fa --output /home/Work/isoquant_27samples_try2 --read_group tag:CB --sqanti_output --threads 32

log1.log

I then resumed the run and monitored the memory usage closely. I noticed that memory consumption increased to more than 600GB while processing chromosome 1 (see log2), so I manually stopped the program to prevent further issues. At that time, chr1, chr7, chr11, and chrM all appeared as ‘Loading read assignments’ but didn’t finish.

isoquant.py --resume --output /home/Work/isoquant_27samples_try2 --threads 32

log2.log

After that, I resumed again with only 2 threads, hoping to reduce memory usage. At start, chr1 and chr7 were being processed. Updates after resume2 for 30 hours: chr1 finished, chr7 and chr11 were being processed. Around 6 hours later, memory started to increase dramatically, so I had to stop the run again (see log3). I am unsure whether I should resume again or try a different approach. Could you please advise the next move?

isoquant.py --resume --output /home/Work/isoquant_27samples_try2 --threads 2

log3.log

Furthermore, would it be helpful if

remove chrM from the 27 BAM files and try again?
Merge all samples’ BAM files by chromosome, e.g., run chr1 from all samples together? Thanks!

andrewprzh commented 6 days ago

@Mangosteen24

Sorry for the delay.

I think the main issue here is --read_group tag:CB. How many barcodes do you have in total? Although grouped counts were optimized, outputting them might still take a lot of time and possibly memory. I will continue optimizing this step.

Removing chrM might work as well, it's been known to be very slow when too many reads map onto it.

Best Andrey