ConesaLab / SQANTI3

Tool for the Quality Control of Long-Read Defined Transcriptomes
GNU General Public License v3.0
198 stars 49 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte #291

Closed TingQi2020 closed 6 months ago

TingQi2020 commented 7 months ago

Is there an existing issue for this?

Have you loaded the SQANTI3.env conda environment?

Problem description

No response

Code sample

/storage/yangjianLab/qiting/software/SQANTI3/sqanti3_qc.py \ ${GTF_NOVEL} ${GTF_REF} ${fasta_file} \ --force_id_ignore \ -o ONT_221samples_stringtie \ -d ${BASE_DIRT}/QC_output \ -c ${BASE_DIRT}/SR_Junction/JXBJ23BD-0199-chr21_SJ.out.tab \ --SR_bam ${BASE_DIRT}/SR_bam/JXBJ23BD-0199-chr21_Aligned.sortedByCoord.out.md.bam \ --skipORF \ --CAGE_peak ${SQANTI_FOLDER}/data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed \ --polyA_motif_list ${SQANTI_FOLDER}/data/polyA_motifs/mouse_and_human.polyA_motif.txt \ --cpus 4 \ --report both

Error

Input pattern: /storage/yangjianLab/qiting/bulkRNA_LR/01.QC_and_Quantification/1.4.quantification/SQANTI/SR_Junction/JXBJ23BD-0199-chr21_SJ.out.tab. The following files found and to be read as junctions: /storage/yangjianLab/qiting/bulkRNA_LR/01.QC_and_Quantification/1.4.quantification/SQANTI/SR_Junction/JXBJ23BD-0199-chr21_SJ.out.tab 3547 junctions read. 3 junctions added to both strands because no strand information from STAR. Using provided BAM files for calculating TSS ratio Traceback (most recent call last): File "/storage/yangjianLab/qiting/software/SQANTI3/sqanti3_qc.py", line 2572, in main() File "/storage/yangjianLab/qiting/software/SQANTI3/sqanti3_qc.py", line 2555, in main run(args) File "/storage/yangjianLab/qiting/software/SQANTI3/sqanti3_qc.py", line 1875, in run isoforms_info, ratio_TSS_dict = isoformClassification(args, isoforms_by_chr, refs_1exon_by_chr, refs_exons_by_chr, junctions_by_chr, junctions_by_gene, start_ends_by_gene, genome_dict, indelsJunc, orfDict, corrGTF) File "/storage/yangjianLab/qiting/software/SQANTI3/sqanti3_qc.py", line 1533, in isoformClassification for file in b: File "/home/yangjianLab/qiting/miniconda3/envs/SQANTI3.env/lib/python3.10/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Anything else?

No response

TianYuan-Liu commented 7 months ago

Hi Ting, it seems like there might be an encoding issue with your input files. Could you check the encoding of the .tab file mentioned in the error? If the issue persists, could you provide a sample of the file causing the error? This will help diagnose the problem more effectively.

TingQi2020 commented 7 months ago

Thanks for your prompt response, Tianyuan. I've checked the .tab file and it looks OK. From the log file pasted above, it seems that SQANTI3 caused an error when it read the .bam file, which was generated by STAR. Attached please find the .tab file for your test. Thank you in advance. JXBJ23BD-0199-chr21_SJ.out.tab.zip

alexpan00 commented 6 months ago

Hi @TingQi2020

The problem is reading the bam file, not the SJ.out.tab. If you only have the Bam file that you want to use in your ${BASE_DIRT}/SR_bam/ directory you can just provide the path without the file name and it will work. Otherwise, you can create a fofn including the full path to the bam file that you want to use, and provide this fofn to the --SR_bam option.

Sorry for the inconvenience and hope this fixes your problem, Alejandro