Magdoll / SQANTI2

SQANTI2 is now replaced by SQANTI3. Please go to: https://github.com/ConesaLab/SQANTI3
Other
38 stars 15 forks source link

Questions about the results #4

Closed melissavantilborg closed 5 years ago

melissavantilborg commented 5 years ago

Hi @Magdoll,

I ran Sqanti2 and I have some questions about the results. First, I noticed that a total of 5475 sequences got added to the .renamed_corrected.fasta file. In the classification output file, they have the suffix _dup2 or _dup3. I tried aligning some of these sequences (for example transcript/22 and transcript/22_dup2), but there seems to be no significant similarity between those sequences. Could you give me some clarification about where these sequences came from?

Also, on your page, you stated that field 27 (FSM_class) from the classification output file should be ignored. Does this mean that the following fields (28: ORF_length, 29: CDS_length, 30: CDS_start and 31: CDS_end) should also be ignored?

Melissa

Magdoll commented 5 years ago

Hi @Mel66 ,

(1) Sorry, my mind is blanking on the _dup suffix and I cannot find in my code where I would've added that. Can you be more specific? Where did the _dup suffix show up? In which files where they not present and in which files did they show up?

(2) Only FSM_class should be ignored for now since I have not implemented them yet. The other fields 28-31 about ORF and CDS prediction are valid and can be used, unless you specifically used the --skipORF option in which case they will all be zero.

--Liz

melissavantilborg commented 5 years ago

The suffix _dup appears in classification.txt, junctions.txt, .renamed_corrected.genePred and .renamed_corrected.gtf. It doesn't appear in renamed_corrected.fasta, but this file contains additional sequences. These added sequences are named the same as some of the original sequences. For example, renamed_corrected.fasta contains 4 sequences named transcript/4922, while the original fasta and renamed.fasta only have 1 sequence with this name. The four previously mentioned files contain transcript/4922_dup2, transcript/4922_dup3, and transcript/4922_dup4. According to classifications.txt, these 3 transcripts are encoded by different genes.

Thank you for your help.

Melissa

Magdoll commented 5 years ago

Hi @Mel66 ,

I'm sorry for the late response as I was on vacation last week.

If renamed_corrected.fasta already had 4 duplicate sequences, I'm going to guess this is because when it was mapped back to the genome (for the genome-based error correction which is SQANTI's first step), it got mapped to multiple places. This is entirely possible when different aligners (or even different versions of the same aligner) was used. For example, GMAP might decide to map a sequence continuously while minimap2 might decide to break it up.

Are you willing to share the input data confidentially? Is this on human? To reproduce your results, I would need the input fasta, the ref genome, and the ref annotation. If you can share, please give me your email so I can request a file upload.

--Liz

melissavantilborg commented 5 years ago

Hi Magdoll,

sorry for the late response, I didn’t see you had answered already. You were right, some of the sequences mapped to multiple genes on the genome, which caused the dups. I worked around this problem by using a gtf file for classification instead of the fasta file.

Sqanti_qc2.py didn’t seem to work with gtf files because it has some bugs. I made some changed for myself, and my results seem fine now. I added the errors I got below, maybe they can be of use to you.

Melissa.


I got this error while running SQANTI with a gtf (-g). I "fixed" this by adding "if not args.gtf:" to line 1377.

R scripting front-end version 3.5.0 (2018-04-23) Cleaning up isoform IDs... Cleaned up isoform fasta file written to: /home/mvantilborg/hq.renamed.fasta Traceback (most recent call last): File "SQANTI2/sqanti_qc2.py", line 1395, in main() File "SQANTI2/sqanti_qc2.py", line 1384, in main if args.aligner_choice == "minimap2": AttributeError: 'Namespace' object has no attribute 'aligner_choice'


Then I got the error below. I worked around it by commenting out line 1369, because it looked like the function rename_isoform_seqids was trying to read my gtf like it was a fasta or fastq file, and the resulting renamed fasta file (hq.renamed.fasta) was empty.

R scripting front-end version 3.5.0 (2018-04-23) Cleaning up isoform IDs... Cleaned up isoform fasta file written to: /home/mvantilborg/hq.renamed.fasta Running SQANTI... Parsing provided files.... Reading genome fasta /usr/local/Genomes/species/H.sapiens/GRCh38_no_alt_analysis_set/reference.fa.... Skipping aligning of sequences because gtf file was provided.

ERROR: gtf has not annotation lines.


Then I got this error below. I worked around this by using the –skipORF prediction because I was doing an ORF prediction myself. But it was probably caused by skipping the renaming.

R scripting front-end version 3.5.0 (2018-04-23) Cleaning up isoform IDs... Cleaned up isoform fasta file written to: /home/mvantilborg/hq.gff Running SQANTI... Parsing provided files.... Reading genome fasta /usr/local/Genomes/species/H.sapiens/GRCh38_no_alt_analysis_set/reference.fa.... Skipping aligning of sequences because gtf file was provided.

Indels will be not calculated since you ran SQANTI without alignment step (SQANTI with gtf format as transcriptome input). **** Predicting ORF sequences... Expected GMST output IDs to be of format ' gene_4|GeneMark.hmm|_aa|||' but instead saw: PB.1.1 gene=PB.1 gene_1|GeneMark.hmm|138_aa|+|3|419! Abort!

Magdoll commented 5 years ago

Close unless further notice.