ConesaLab / SQANTI3

Tool for the Quality Control of Long-Read Defined Transcriptomes
GNU General Public License v3.0
189 stars 46 forks source link

Isoforms are misplaced and contain deletions (in-house reference/annotation) #251

Closed eprdz closed 4 months ago

eprdz commented 5 months ago

When running a specific dataset in SQANTI3 (both v.5.2 and v.4.3), the final isoforms.fasta file contains isoforms that have some mononucleotide deletions and are misplaced compared to their expected positions according to the gtf file. Do you know what might be happening? I would not mind sharing a subset of the data.

Thank you in advance.

carolinamonzo commented 5 months ago

Hi @eprdz,

Thanks for using SQANTI3! This only happened to you on the one dataset? If you have run SQANTI3 successfully before on other datasets, I would imagine that one of your input files (one of the GTFs or the reference genome fasta) are incorrectly formatted. Which software did you use to make your transcript models GTF? Could you share your SQANTI3 command? Am I understanding correctly that SQANTI3 run without giving any errors or warnings?

Best, Carolina.

eprdz commented 5 months ago

Thank you for your help! Indeed the problem is related to the reference annotation. I obtained a GTF file using AUGUSTUS with the same reference genome used in SQANTI3. Nevertheless, I could not use that GTF file with SQANTI3, so I used TSEBRA to fix de formatting as it is exposed in this issue. As you can see in this issue, the "gene" and "transcript" lines are wrongly formatted.

I know that this is not directly related to SQANTI3, but it might help other people with the same issue. Do you know any other way to reformat the GTF file to make it run on SQANTI3? Thank you for your help again!

carolinamonzo commented 5 months ago

Hi @eprdz , Sorry, I'm very confused, how is your GTF formatted? Would you mind sharing the first couple of lines so I can see why your GTF is not recognised as a GTF?

eprdz commented 5 months ago

Yes, this is the GTF file obtained from using AUGUSTUS:

Mind_KP_01      AUGUSTUS        gene    19239   31252   .       +       .       g1
Mind_KP_01      AUGUSTUS        transcript      19239   31252   0.67    +       .       g1.t1
Mind_KP_01      AUGUSTUS        start_codon     19239   19241   .       +       0       transcript_id "g1.t1"; gene_id "g1";
Mind_KP_01      AUGUSTUS        CDS     19239   19456   0.98    +       0       transcript_id "g1.t1"; gene_id "g1";
Mind_KP_01      AUGUSTUS        exon    19239   19456   .       +       .       transcript_id "g1.t1"; gene_id "g1";
Mind_KP_01      AUGUSTUS        intron  19457   19537   1       +       .       transcript_id "g1.t1"; gene_id "g1";

And this is the same lines after TSEBRA (no "transcript" nor "gene" lines):

Mind_KP_01      AUGUSTUS        start_codon     19239   19241   .       +       0       transcript_id "Mind_KP_01+_g1.t1"; gene_id "Mind_KP_01+_g1";
Mind_KP_01      AUGUSTUS        CDS     19239   19456   0.98    +       0       transcript_id "Mind_KP_01+_g1.t1"; gene_id "Mind_KP_01+_g1";
Mind_KP_01      AUGUSTUS        exon    19239   19456   .       +       .       transcript_id "Mind_KP_01+_g1.t1"; gene_id "Mind_KP_01+_g1";
Mind_KP_01      AUGUSTUS        intron  19457   19537   1       +       .       transcript_id "Mind_KP_01+_g1.t1"; gene_id "Mind_KP_01+_g1";
carolinamonzo commented 5 months ago

Hi Quique,

Your GTF looks fine, we sometimes also remove the CDS when using AUGUSTUS, since they represent the exons (as you can see, they have the same positions). Regarding your question, I still don't understand what exactly you are asking. It's possible the original fasta of your isoforms differs from the corrected.fasta since SQANTI makes that one using the reference (you only inputted the GTF), and this may be slightly different to the sample you used to extract the transcript models (for example point mutations or indels in your sample compared to the reference). But as I said, I'm not sure what the question is, so please let me know if you solved the problem with this info.

Best, Carolina.

eprdz commented 5 months ago

Hi Carol, thank you for your attention with this strange case.

What I really want to know is if you know what can be causing the issue that I described.

Sorry if I did not explain well, I will try to be clearer: I have a PacBio dataset, an in-house reference genome and an in-house annotation file (this annotation file was created using AUGUSTUS on the reference genome). The PacBio dataset is simply collapsed using minimap2 + cDNA cupcake (collapse_isoforms_by_sam.py) to make a transcriptome-gtf file. This file undergoes SQANTI3 with the mentioned reference genome and annotation file. When you map the corrected.fasta against this in-house reference genome and you see the alignment (let's say inhouse_isoforms.bam in the bottom image) in a genome browser you see a lot of indels and that are really misplaced.

To get more insights about what can be causing the problem, I used minimap2 + cDNA cupcake + SQANTI3 with an NCBI reference from the same species The corrected.fasta was mapped against the in-house reference genome using minimap2 in order to compare it with the other alignment (in the image, this alignment is called ncbi_isoforms.bam). As you can see in this case isoforms are correctly placed and there are no indels.

The CD_Hit_Pre_sqanti3_Standard_isoforms.bam are the reads used before SQANTI3.

Forgot to mention that the "transcriptome.gtf" in the top of the image was created with SQANTI3 with the in-house reference genome, so it seems that the only problem with this dataset in SQANTI3 is related to the transcriptome FASTA file, not with the transcriptome GTF file.

image

Maybe now it is clearer. Thank you again!

alexpan00 commented 5 months ago

Hi @eprdz,

I will do a recap to check if I understood everything:

When you provide SQANTI3 a GTF as input, the corrected FASTA is generated directly from the genome sequence. Could you obtain the FASTA sequence for the problematic transcript out of SQANTI3 and compare it to the one provided by SQANTI3? You can use:

gffread input.gtf -g in_house_genome.fasta -w output.fasta

eprdz commented 5 months ago

As you can see, with your proposal (gffread.bam) everything is aligned well to the in-house genome (which makes sense I think).

image

I don't know why but I think that the problem comes from the genome. I aligned some portion of the first chromosome of both genomes (the reference and the in-house) and there are lots of small indels and deletions that seems to match to what is happening to the long reads:

image

carolinamonzo commented 4 months ago

Hi @eprdz since this seems to be a problem with your in-house reference genome and not SQANTI, I will close this issue. I hope you figure out what is going on with your data! Best wishes.