ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
Other
128 stars 11 forks source link

transcript numbers don't match between OUT.transcript_model_grouped_tpm.tsv and OUT.transcript_models.gtf #195

Open dudududu12138 opened 1 month ago

dudududu12138 commented 1 month ago

Hi, I recently used isoquant(version 3.4.1) to detect isoforms from multiple samples. But the results are a little odd. Below are my quaetions:

  1. There are 167413 transcripts in OUT.transcript_models.gtf while 166857 in OUT.transcript_model_tpm.tsv.
  2. I didn't see the <combine.> files. There are only <grouped*> files. So what is the result file that report the expression levels of multisamples?
  3. Below is all my outputs:
    aux                          OUT.gene_grouped_counts.tsv  OUT.transcript_counts.tsv          OUT.transcript_model_grouped_counts.tsv  OUT.transcript_model_tpm.tsv
    OUT.corrected_reads.bed.gz   OUT.gene_grouped_tpm.tsv     OUT.transcript_grouped_counts.tsv  OUT.transcript_model_grouped_tpm.tsv     OUT.transcript_tpm.tsv
    OUT.extended_annotation.gtf  OUT.gene_tpm.tsv             OUT.transcript_grouped_tpm.tsv     OUT.transcript_model_reads.tsv.gz
    OUT.gene_counts.tsv          OUT.read_assignments.tsv.gz  OUT.transcript_model_counts.tsv    OUT.transcript_models.gtf
  4. Below is my codes:
    
    samples=`ls ../minimap2/*sorted.bam`

isoquant.py -t 30 \ --reference $ref \ --genedb $anno --complete_genedb \ --bam $samples \ --data_type pacbio_ccs \ -o $out/


Thank you very much!
andrewprzh commented 3 weeks ago

Dear @dudududu12138

  1. Could you send me the files? This looks unexpected.
  2. Currently, if you provide several BAM files, they will treated as distinct samples within the same experiment. Thus, a single output folder with a single transcript_models.gtf files will be created. "Per-file" count/TPM tables are stored in _grouped files. Combined tables are now created only if you provide 2 or more experiments via YAML dataset file.

Best Andrey

dudududu12138 commented 1 week ago

Dear @dudududu12138

  1. Could you send me the files? This looks unexpected.
  2. Currently, if you provide several BAM files, they will treated as distinct samples within the same experiment. Thus, a single output folder with a single transcript_models.gtf files will be created. "Per-file" count/TPM tables are stored in _grouped files. Combined tables are now created only if you provide 2 or more experiments via YAML dataset file.

Best Andrey

Sorry for the late reply. Below are the OUT.transcript_model_counts.tsv and OUT.transcript_models.gtf files. There are more transcripts in the gtf file. That is ,the transcripts in *counts.tsv are subset of transcripts in *models.gtf. It is odd. OUT.transcript_models.gtf.gz OUT.transcript_model_counts.tsv.gz

Moreover, I noticed that in my result, OUT.transcript_model_reads.tsv.gz and OUT.transcript_models.gtf have the same number of transcripts. While OUT.transcript_model_tpm.tsv and OUT.transcript_model_counts.tsv have the same number of transcripts and they are the subset of OUT.transcript_models.gtf

dudududu12138 commented 1 week ago

By the way, I checked myself. I extracted an example of a transcript that is reported in the OUT.transcript_models.gtf file but not in the OUT.transcript_model_counts.tsv file. Below is the transcript id and read id that support the transcript that extracted from OUT.transcript_model_reads.tsv.gz. 1719198912153 Then I extracted the these reads from OUT.read_assignments.tsv.gz and saved them into file(example.read.assignment.txt.) It is odd that some reads that support the novel transcript are full splice match and uniquely assigned to the reference transcript. So I am very confused. Is the OUT.transcript_model_counts.tsv just counts the reads that uniquely mapped to the transcript?

andrewprzh commented 5 days ago

@dudududu12138

Yes, this is really odd. Could you also send me the full log file?

andrewprzh commented 5 days ago

@dudududu12138

Here is what I found so far. Output GTF and supporting reads are likely correct. However, expression estimation algorithm is independent from transcript discovery and for some reason assigns 0 to some of the transcripts. Transcripts with 0 counts/TPMs are not printed to transcript_model_counts.tsv. Hence the difference.

I will now check why expression estimation algorithm assigns 0 to these transcripts despite the fact they are supported by reads.

Best Andrey