ConesaLab / SQANTI3

Tool for the Quality Control of Long-Read Defined Transcriptomes
GNU General Public License v3.0
196 stars 47 forks source link

CDS not annotated using isoAnnotLite #230

Open Ajay-097 opened 11 months ago

Ajay-097 commented 11 months ago

I ran Sqanti3 with the isoAnnot option to get the gff3 file. I went for the approach not because I wanted a tappas compatible file but needed the UTR, Poly A annotations for my file. The process did finish running successfully and I got an output gff3 file. But I can see it has certain CDS annotated without any start and stop positions and there are dots instead. I would like to know if this is actually a 'bug' or its because of my non model organism (strongyloides ratti). Here's the output gff3 file

image
almart7 commented 11 months ago

Dear @Ajay-097 , could you show me the command you used? I would like to know which functional annotation file you used with the IsoAnnotLite option.

Ajay-097 commented 11 months ago

Hi @almart7, Please find below the command I used. python /opt/SQANTI3-5.1.2/sqanti3_qc.py \

\ --force_id_ignore -t 30 -o Sratti_output --isoAnnotLite I have attached the annotation file I have used for running this step as a txt file. Please let me know if you require any further info. I only got a gff3 file from wormbase so I had to convert it into gtf using 'gffread'. [strongyloides_ratti.annotations.txt](https://github.com/ConesaLab/SQANTI3/files/12729439/strongyloides_ratti.annotations.txt)
aarzalluz commented 11 months ago

Hi @Ajay-097 -seems like you used a GFF3 file from a standard database, but you need to use a pre-computed tappAS GFF3 file, which is different. Unfortunately, these are only available for some model organisms (have a look at the wiki site for more info).

You can still have your GTF formatted as a GFF3 using IsoAnnotLite, which will include transcript-level structural annotations, but there will be no protein features added, because these are transferred from the tappAS file. However, if you run SQANTI3 with ORF predictions activated, you may have the coding sequence info there in the classification.txt file.

Ángeles

Ajay-097 commented 11 months ago

Hi @aarzalluz... Thanks for your response. I ran Sqanti3 with the ORF predictions activated and then used isoAnnotLite to format my gtf to a gff3. I can still see that the CDS is not properly annotated and there are dots '.' instead of start and end positions which causes errors when I try to visualize the file. I also noted that the transcripts that have CDS annotation issue are marked as non-coding in the classification.txt file.

almart7 commented 11 months ago

Dear @Ajay-097 I would like to look deeper into this problem. Is it okay for you to share with me the data and/or download links of the files you used? Here is my email.

Ajay-097 commented 11 months ago

Hi @almart7, Thanks for your response. I have sent you an email with all the requested info.

Sparkle-27 commented 5 months ago

Hi @almart7, Thanks for your response. I have sent you an email with all the requested info.

Hi @almart7 @aarzalluz , I also ran IsoAnnotLite to for gtf to a gff3 files, and found CDS annotation with dots '.' . Did you solve the problem, and by the way, how could we get a pre-computed tappAS GFF3 file with protein features in other species? Best wishes.

Sparkle-27 commented 5 months ago

Hi @almart7, Thanks for your response. I have sent you an email with all the requested info.

Hi @almart7 @aarzalluz , I also ran IsoAnnotLite to for gtf to a gff3 files, and found CDS annotation with dots '.' . Did you solve the problem, and by the way, how could we get a pre-computed tappAS GFF3 file with protein features in other species? Best wishes.

I noticed most of CDS with dots '.' were annotated with non_coding, most of them are single-exon isoforms without ORF_length, CDS_start and CDS_end in *_classification.txt in SQANTI3 QC.