ConesaLab / SQANTI3

Tool for the Quality Control of Long-Read Defined Transcriptomes
GNU General Public License v3.0
198 stars 49 forks source link

[BUG] IndexError: list index out of range #329

Closed yerry77 closed 1 month ago

yerry77 commented 1 month ago

Is there an existing issue for this?

Have you loaded the SQANTI3.env conda environment?

Problem description

I have remove the '*'strand transcripts,but it still cannot work.

Code sample

conda activate SQANTI3.env

export PYTHONPATH=$PYTHONPATH:/data/p/SQANTI3/cDNA_Cupcake/sequence/ export PYTHONPATH=$PYTHONPATH:/data/p/SQANTI3/cDNA_Cupcake/

python /data/p/SQANTI3/SQANTI3-5.2.1/sqanti3_qc_2.py \ /data1/x/partners/LiFengXian/20231224PPG/20240626PPG2/20240912short_reads/20240913merge_gtf/filtered_gtf.gtf \ /data1/pub/genome/Human/humanGENCODE/gencode.v46.annotation.gtf \ /data1/pub/genome/Human/humanGENCODE/GRCh38.p14.genome.fa \ --CAGE_peak /data1/x/partners/LiFengXian/20231224PPG/raw/20240328SQANTI3/refTSS_v3.3_mouse_coordinate.mm10.bed \ --polyA_motif_list /data/p/SQANTI3/SQANTI3/data/polyA_motifs/mouse_and_human.polyA_motif.txt \ -o PPG \ -d /XCLabServer003_fastIO/20240918SQANTI3/ \ --cpus 80 \ --report both \ --short_reads /XCLabServer003_fastIO/20240918SQANTI3/PPG_short_reads.fofn

Error

Error corrected FASTA /XCLabServer003_fastIO/20240918SQANTI3/PPG_corrected.fasta already exists. Using it... Predicting ORF sequences... ORF file /XCLabServer003_fastIO/20240918SQANTI3/PPG_corrected.faa already exists. Using it.... Parsing Reference Transcriptome.... /XCLabServer003_fastIO/20240918SQANTI3/refAnnotation_PPG.genePred already exists. Using it. **** Parsing Isoforms.... Running calculation of TSS ratio Traceback (most recent call last): File "/data/p/SQANTI3/SQANTI3-5.2.1/sqanti3_qc_2.py", line 2577, in main() File "/data/p/SQANTI3/SQANTI3-5.2.1/sqanti3_qc_2.py", line 2560, in main run(args) File "/data/p/SQANTI3/SQANTI3-5.2.1/sqanti3_qc_2.py", line 1875, in run isoforms_info, ratio_TSS_dict = isoformClassification(args, isoforms_by_chr, refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene, start_ends_by_gene, genome_dict, indelsJunc, orfDict, corrGTF, star_out, star_index, SJcovNames, SJcovInfo) File "/data/p/SQANTI3/SQANTI3-5.2.1/sqanti3_qc_2.py", line 1545, in isoformClassification inside_bed, outside_bed = get_TSS_bed(corrGTF, chr_order) File "/data/p/SQANTI3/SQANTI3-5.2.1/utilities/short_reads.py", line 127, in get_TSS_bed strand=str(loc[2]) IndexError: list index out of range

Anything else?

No response

Fabian-RY commented 1 month ago

Hi @yerry77 thanks for reporting this problem

That error seems to occurr when tryiing to read the *corrected.gtf generated by sqanti3_qc.py itself, by parsing the start position, end position and, specifically, the strand. However, with my gtf files, I'm unable to replicate it.

I see that you are using Sqanti3 v5.2.1. We have an updated v5.2.2 version that i recommend you to test to check that this error still happens in the latest version.

Can you please share the corrected gtf or check that the *corrected.gtf does exists, it's not empty and is correctly formatted? It seems to be related to reading and parsing that file.

Thanks

yerry77 commented 1 month ago

I have confirmed that the *corrected.gtf exists, but the file format is different from other gtf files. It lacks the line with feature as gene, that is, it lacks gene information. Is this the reason? But when I ran this file before, the output gtf file format also did not have the line with feature as gene. Here, my input gtf file does have the line with feature as gene, and this file is formed by merging the data obtained from short-read sequencing.

Fabian-RY commented 1 month ago

Lacking the gene feature is a expected behavior, so i don't think that is the problem here. The gtf file should have 9 columns, like the little example I attach. example_corrected.txt

Fabian-RY commented 1 month ago

Hi @yerry77

Thanks to #334 we noticed a bug regarding parsing of this file, when the transcripts do no have an strand assigned, which used StringTie data, that makes sqanti fail. Do all of your transcripts have a '+' or '-' in the 6th column, or there are some that have a dot '.'?

I'm working on a fix for that error, so I wanted to know if this also aplies to your data

Regards