Transcript quantification results are inconsistent

biochristmas commented 4 weeks ago

Hi, I ran this command to perform Isoquant quantification with the aim of obtaining transcript quantification results for all samples: isoquant.py --reference genome.fa --genedb ./merge.combined.gtf --fastq CK.derRNA.fastq T1.derRNA.fastq T2.derRNA.fastq --data_type nanopore --model_construction_strategy all --threads 60 -o isoquant_quant. The GTF annotation file contains 320,000 transcripts. Based on Isoquant quantification, two quantification result files were generated: OUT.transcript_model_grouped_tpm.tsv and OUT.transcript_grouped_tpm.tsv. The number of transcripts (rows) in OUT.transcript_model_grouped_tpm.tsv is 260,000, while in OUT.transcript_grouped_tpm.tsv it is 320,000. Can I interpret the OUT.transcript_grouped_tpm.tsv file as containing the total transcript quantification results? Additionally, I noticed that the TPM values for the same transcript in OUT.transcript_model_grouped_tpm.tsv and OUT.transcript_grouped_tpm.tsv are inconsistent. Third question: If all TPM values for a transcript are zero in the OUT.transcript_grouped_tpm.tsv file, then that transcript will not appear in the OUT.transcript_model_grouped_tpm.tsv file. Therefore, I believe that the OUT.transcript_model_grouped_tpm.tsv file contains quantification results for transcripts that are expressed, while the OUT.transcript_grouped_tpm.tsv file contains quantification results for all transcripts, regardless of whether all sample TPM values are zero.

andrewprzh commented 3 weeks ago

Dear @biochristmas

OUT.transcript_grouped_tpm.tsv contain all the reference transcripts, even those with 0 counts. There are no novel transcripts in this file. OUT.transcript_model_grouped_tpm.tsv contains all the transcripts from the OUT.transcript_models.gtf (both novel and known). There should be no rows with 0 counts/TPMs since these transcripts must be supported by reads.

These files are produced by different algorithms, so some inconsistency is expected. Although I'd agree it's quite noticeable n your example. Which version are you using, by the way?

Best Andrey

biochristmas commented 3 weeks ago

@andrewprzh , I am using version 3.4.1 of IsoQuant。I have a new question: I noticed that when I run this command to expand transcript annotations on the reference genome: isoquant.py --reference ./reference.fa --genedb ./reference.gtf --fastq ./1_clustered.fasta ./2_clustered.fasta ./3_clustered.fasta --data_type pacbio_ccs --model_construction_strategy all --fl_data --threads 60 -o isoquant_result. Note: The FASTA sequences provided in the --fastq parameter are the results of clustering full-length transcript sequences using Isoseq3, and the number of sequences in clustered.fasta is much fewer compared to flnc.fasta. When I execute this command, the resulting file OUT.extended_annotation.gtf contains a total of 107,290 transcripts. However, when I use flnc.fasta (the full-length transcript sequences before clustering) as the input for the --fastq parameter, the resulting file OUT.extended_annotation.gtf contains 185,626 transcripts. So, I would like to ask if the number of transcripts obtained when expanding reference genome annotations using isoquant.py is influenced by the number of sequences in the input files."

andrewprzh commented 1 week ago

@biochristmas

Yes, the number of reported transcripts can be affected by the number of sequences (something that we also call "read support"). So such result is expected.

If you use --data_type pacbio_ccs with clustered PacBio transcripts, it may happen that some of the transcripts (especially novel) are not reported since they are supported only by a single sequence. You may try running IsoQuant with --data_type transcripts, which will require only a single supporting sequence for a transcript to be reported. For FLNC reads using --data_type pacbio_ccs should work fine.

Best Andrey

ablab / IsoQuant

Transcript quantification results are inconsistent #225