ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
https://ablab.github.io/IsoQuant/
Other
144 stars 13 forks source link

invalid continuation byte error #206

Closed cariocow closed 4 days ago

cariocow commented 3 months ago

Hi, thanks for making IsoQuant, when i run my ont sequencing data, i got the following error. would you please kindly help to fix it. many thanks! :)

Command line: /home/cario/bin/miniconda3/envs/isoquant/bin/isoquant.py --reference /Temporary-data/cario/reference/hg38_111/Homo_sapiens.GRCh38.dna.toplevel.fa --genedb /Temporary-data/cario/reference/hg38_111/Homo_sapiens.GRCh38.111.gtf --fastq /Temporary-data/cario/BT019327_sup/scnanogps_111/fastq/bt019327_c10.fastq --data_type nanopore -o /Temporary-data/cario/isoformswitchanalyser/isoquant -t 10 2024-06-25 17:50:48,444 - INFO - Running IsoQuant version 3.4.1 2024-06-25 17:50:48,444 - WARNING - Output folder already contains a previous run, some files may be overwritten. Use --resume to resume a failed run. Use --force to avoid this message. 2024-06-25 17:50:48,444 - WARNING - Press Ctrl+C to interrupt the run now. 2024-06-25 17:50:57,446 - INFO - Overwriting the previous run 2024-06-25 17:50:58,479 - WARNING - /Temporary-data/cario/isoformswitchanalyser/isoquant/OUT folder already exists, some files may be overwritten 2024-06-25 17:50:58,480 - WARNING - /Temporary-data/cario/isoformswitchanalyser/isoquant/OUT/aux folder already exists, some files may be overwritten 2024-06-25 17:50:58,480 - INFO - Novel unspliced transcripts will not be reported, set --report_novel_unspliced true to discover them 2024-06-25 17:50:58,481 - INFO - === IsoQuant pipeline started === 2024-06-25 17:50:58,481 - INFO - gffutils version: 0.13 2024-06-25 17:50:58,481 - INFO - pysam version: 0.22.1 2024-06-25 17:50:58,481 - INFO - pyfaidx version: 0.8.1.1 2024-06-25 17:50:58,481 - INFO - Checking input gene annotation 2024-06-25 17:51:33,997 - INFO - Gene annotation seems to be correct 2024-06-25 17:51:34,187 - INFO - Converting gene annotation file to .db format (takes a while)... 2024-06-26 00:30:25,704 - INFO - Gene database written to /Temporary-data/cario/isoformswitchanalyser/isoquant/Homo_sapiens.GRCh38.111.db 2024-06-26 00:30:25,705 - INFO - Provide this database next time to avoid excessive conversion 2024-06-26 00:30:25,707 - INFO - Indexing reference 2024-06-26 00:30:28,972 - INFO - Converting gene annotation file /Temporary-data/cario/isoformswitchanalyser/isoquant/Homo_sapiens.GRCh38.111.db to .bed format 2024-06-26 00:31:51,928 - INFO - Gene database BED written to /Temporary-data/cario/isoformswitchanalyser/isoquant/Homo_sapiens.GRCh38.111.bed 2024-06-26 00:31:51,939 - INFO - Aligning /Temporary-data/cario/BT019327_sup/scnanogps_111/fastq/bt019327_c10.fastq to the reference, alignments will be saved to /Temporary-data/cario/isoformswitchanalyser/isoquant/OUT/aux/OUT_bt019327_c10_9142ca_ba9abe_2e640b.bam 2024-06-26 00:31:51,942 - INFO - Running minimap2 version 2.28-r1209 (takes a while) 2024-06-26 00:32:13,944 - INFO - Sorting alignments 2024-06-26 00:32:16,202 - INFO - Indexing alignments 2024-06-26 00:32:17,372 - INFO - Loading gene database from /Temporary-data/cario/isoformswitchanalyser/isoquant/Homo_sapiens.GRCh38.111.db 2024-06-26 00:32:17,696 - INFO - Loading reference genome from /Temporary-data/cario/reference/hg38_111/Homo_sapiens.GRCh38.dna.toplevel.fa 2024-06-26 00:32:17,703 - CRITICAL - IsoQuant failed with the following error, please, submit this issue to https://github.com/ablab/IsoQuant/issuesTraceback (most recent call last): File "/home/cario/bin/miniconda3/envs/isoquant/bin/isoquant.py", line 808, in main(sys.argv[1:]) File "/home/cario/bin/miniconda3/envs/isoquant/bin/isoquant.py", line 802, in main run_pipeline(args) File "/home/cario/bin/miniconda3/envs/isoquant/bin/isoquant.py", line 754, in run_pipeline dataset_processor = DatasetProcessor(args) ^^^^^^^^^^^^^^^^^^^^^^ File "/home/cario/bin/miniconda3/envs/isoquant/share/isoquant-3.4.1-0/src/dataset_processor.py", line 403, in init self.reference_record_dict = Fasta(self.args.reference, indexname=args.fai_file_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cario/bin/miniconda3/envs/isoquant/lib/python3.12/site-packages/pyfaidx/init.py", line 1090, in init self.faidx = Faidx( ^^^^^^ File "/home/cario/bin/miniconda3/envs/isoquant/lib/python3.12/site-packages/pyfaidx/init.py", line 505, in init self.build_index() File "/home/cario/bin/miniconda3/envs/isoquant/lib/python3.12/site-packages/pyfaidx/init.py", line 606, in build_index line = line.decode() ^^^^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 16: invalid continuation byte

andrewprzh commented 3 months ago

Dear @cariocow

Something weird is happening during reading the reference genome with pyfaidx. Could you send me the first few lines of the reference genome?

Best Andrey

cariocow commented 3 months ago

sorry i fixed it. re-dl the reference.

then i got another issue. here's the log. Command line: /home/cario/bin/miniconda3/envs/isoquant/bin/isoquant.py --reference /Temporary-data/cario/reference/hg38_111/Homo_sapiens.GRCh38.dna.toplevel.fa --genedb /Temporary-data/cario/reference/hg38_111/Homo_sapiens.GRCh38.111.gtf --fastq /Temporary-data/cario/BT019327_sup/scnanogps_111/fastq/bt019327_c10.fastq --data_type nanopore -o /Temporary-data/cario/isoformswitchanalyser/isoquant -t 10 2024-06-26 11:16:48,460 - INFO - Running IsoQuant version 3.4.1 2024-06-26 11:16:48,460 - WARNING - Output folder already contains a previous run, some files may be overwritten. Use --resume to resume a failed run. Use --force to avoid this message. 2024-06-26 11:16:48,460 - WARNING - Press Ctrl+C to interrupt the run now. 2024-06-26 11:16:57,462 - INFO - Overwriting the previous run 2024-06-26 11:16:58,464 - WARNING - /Temporary-data/cario/isoformswitchanalyser/isoquant/OUT folder already exists, some files may be overwritten 2024-06-26 11:16:58,464 - WARNING - /Temporary-data/cario/isoformswitchanalyser/isoquant/OUT/aux folder already exists, some files may be overwritten 2024-06-26 11:16:58,465 - INFO - Novel unspliced transcripts will not be reported, set --report_novel_unspliced true to discover them 2024-06-26 11:16:58,465 - INFO - === IsoQuant pipeline started === 2024-06-26 11:16:58,465 - INFO - gffutils version: 0.13 2024-06-26 11:16:58,465 - INFO - pysam version: 0.22.1 2024-06-26 11:16:58,465 - INFO - pyfaidx version: 0.8.1.1 2024-06-26 11:16:58,465 - INFO - Checking input gene annotation 2024-06-26 11:17:31,049 - INFO - Gene annotation seems to be correct 2024-06-26 11:17:31,248 - INFO - Converting gene annotation file to .db format (takes a while)... 2024-06-26 17:59:45,246 - INFO - Gene database written to /Temporary-data/cario/isoformswitchanalyser/isoquant/Homo_sapiens.GRCh38.111.db 2024-06-26 17:59:45,247 - INFO - Provide this database next time to avoid excessive conversion 2024-06-26 17:59:45,248 - INFO - Indexing reference 2024-06-26 17:59:45,249 - INFO - Converting gene annotation file /Temporary-data/cario/isoformswitchanalyser/isoquant/Homo_sapiens.GRCh38.111.db to .bed format 2024-06-26 18:01:14,013 - INFO - Gene database BED written to /Temporary-data/cario/isoformswitchanalyser/isoquant/Homo_sapiens.GRCh38.111.bed 2024-06-26 18:01:14,024 - INFO - Aligning /Temporary-data/cario/BT019327_sup/scnanogps_111/fastq/bt019327_c10.fastq to the reference, alignments will be saved to /Temporary-data/cario/isoformswitchanalyser/isoquant/OUT/aux/OUT_bt019327_c10_b3d719_467c1c_f98cde.bam 2024-06-26 18:01:14,027 - INFO - Running minimap2 version 2.28-r1209 (takes a while) 2024-06-26 18:01:37,859 - INFO - Sorting alignments 2024-06-26 18:01:40,610 - INFO - Indexing alignments 2024-06-26 18:01:41,794 - INFO - Loading gene database from /Temporary-data/cario/isoformswitchanalyser/isoquant/Homo_sapiens.GRCh38.111.db 2024-06-26 18:01:42,146 - INFO - Loading reference genome from /Temporary-data/cario/reference/hg38_111/Homo_sapiens.GRCh38.dna.toplevel.fa 2024-06-26 18:01:42,184 - INFO - Processing 1 experiment 2024-06-26 18:01:42,184 - INFO - Processing experiment OUT 2024-06-26 18:01:42,184 - INFO - Experiment has 1 BAM file: /Temporary-data/cario/isoformswitchanalyser/isoquant/OUT/aux/OUT_bt019327_c10_b3d719_467c1c_f98cde.bam 2024-06-26 18:01:42,184 - INFO - Collecting read alignments 2024-06-26 18:01:43,004 - INFO - Processing chromosome 6 2024-06-26 18:01:43,004 - INFO - Processing chromosome 3 2024-06-26 18:01:43,035 - INFO - Processing chromosome 5 2024-06-26 18:01:43,049 - INFO - Processing chromosome 1 2024-06-26 18:01:43,087 - INFO - Processing chromosome 2 2024-06-26 18:01:43,110 - INFO - Processing chromosome 8 2024-06-26 18:01:43,122 - INFO - Processing chromosome 7 2024-06-26 18:01:43,122 - INFO - Processing chromosome X 2024-06-26 18:01:43,143 - INFO - Processing chromosome 9 2024-06-26 18:01:43,153 - INFO - Processing chromosome 4 2024-06-26 18:01:43,774 - INFO - Processing chromosome 11 2024-06-26 18:01:43,774 - INFO - Processing chromosome 10 2024-06-26 18:01:43,812 - INFO - Processing chromosome 12 2024-06-26 18:01:43,856 - INFO - Processing chromosome 13 2024-06-26 18:01:43,881 - INFO - Processing chromosome 14 2024-06-26 18:01:43,907 - INFO - Processing chromosome 15 2024-06-26 18:01:43,911 - INFO - Processing chromosome 16 2024-06-26 18:01:43,970 - INFO - Processing chromosome 18 2024-06-26 18:01:43,975 - INFO - Processing chromosome 17 2024-06-26 18:01:44,098 - INFO - Processing chromosome 20 2024-06-26 18:01:44,490 - INFO - Processing chromosome 19 2024-06-26 18:01:44,502 - INFO - Processing chromosome Y 2024-06-26 18:01:44,532 - INFO - Processing chromosome 22 2024-06-26 18:01:44,568 - INFO - Processing chromosome 21 2024-06-26 18:01:44,606 - INFO - Processing chromosome HG76_PATCH 2024-06-26 18:01:44,981 - CRITICAL - IsoQuant failed with the following error, please, submit this issue to https://github.com/ablab/IsoQuant/issuesconcurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/home/cario/bin/miniconda3/envs/isoquant/lib/python3.12/concurrent/futures/process.py", line 263, in _process_worker r = call_item.fn(*call_item.args, *call_item.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cario/bin/miniconda3/envs/isoquant/lib/python3.12/concurrent/futures/process.py", line 212, in _process_chunk return [fn(args) for args in chunk] ^^^^^^^^^ File "/home/cario/bin/miniconda3/envs/isoquant/share/isoquant-3.4.1-0/src/dataset_processor.py", line 129, in collect_reads_in_parallel AlignmentCollector(chr_id, bam_file_pairs, args, illumina_bam, gffutils_db, current_chr_record, read_grouper) File "/home/cario/bin/miniconda3/envs/isoquant/share/isoquant-3.4.1-0/src/alignment_processor.py", line 240, in init self.bam_pairs[0][0].get_reference_length(self.chr_id), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "pysam/libcalignmentfile.pyx", line 1919, in pysam.libcalignmentfile.AlignmentFile.get_reference_length File "pysam/libcalignmentfile.pyx", line 511, in pysam.libcalignmentfile.AlignmentHeader.get_reference_length KeyError: 'unknown reference 1' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/cario/bin/miniconda3/envs/isoquant/bin/isoquant.py", line 808, in main(sys.argv[1:]) File "/home/cario/bin/miniconda3/envs/isoquant/bin/isoquant.py", line 802, in main run_pipeline(args) File "/home/cario/bin/miniconda3/envs/isoquant/bin/isoquant.py", line 755, in run_pipeline dataset_processor.process_all_samples(args.input_data) File "/home/cario/bin/miniconda3/envs/isoquant/share/isoquant-3.4.1-0/src/dataset_processor.py", line 415, in process_all_samples self.process_sample(sample) File "/home/cario/bin/miniconda3/envs/isoquant/share/isoquant-3.4.1-0/src/dataset_processor.py", line 439, in process_sample self.collect_reads(sample) File "/home/cario/bin/miniconda3/envs/isoquant/share/isoquant-3.4.1-0/src/dataset_processor.py", line 511, in collect_reads for storage, read_groups, alignment_stats in results: File "/home/cario/bin/miniconda3/envs/isoquant/lib/python3.12/concurrent/futures/process.py", line 642, in _chain_from_iterable_of_lists for element in iterable: File "/home/cario/bin/miniconda3/envs/isoquant/lib/python3.12/concurrent/futures/_base.py", line 619, in result_iterator yield _result_or_cancel(fs.pop()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/cario/bin/miniconda3/envs/isoquant/lib/python3.12/concurrent/futures/_base.py", line 317, in _result_or_cancel return fut.result(timeout) ^^^^^^^^^^^^^^^^^^^ File "/home/cario/bin/miniconda3/envs/isoquant/lib/python3.12/concurrent/futures/_base.py", line 456, in result return self.get_result() ^^^^^^^^^^^^^^^^^^^ File "/home/cario/bin/miniconda3/envs/isoquant/lib/python3.12/concurrent/futures/_base.py", line 401, in get_result raise self._exception KeyError: 'unknown reference 1'

andrewprzh commented 3 months ago

Sounds like your GTF file and you FASTA reference have distinct chromosome names (e.g. "chr1" and "1"). Could you check this?

cariocow commented 3 months ago

Both of them seem using the same chromosome names as "1" for chrmosome 1.

the head of the gtf list as below:

!genome-build GRCh38.p14

!genome-version GRCh38

!genome-date 2013-12

!genome-build-accession GCA_000001405.29

!genebuild-last-updated 2023-07

1 havana gene 182696 184174 . + . gene_id "ENSG00000279928"; gene_version "2"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; 1 havana transcript 182696 184174 . + . gene_id "ENSG00000279928"; gene_version "2"; transcript_id "ENST00000624431"; transcript_version "2"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L17-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";

And, the headers of the ref.fa are as:

list_fasta_headers("/Temporary-data/cario/reference/hg38_111/Homo_sapiens.GRCh38.dna.toplevel.fa") 1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF 2 dna:chromosome chromosome:GRCh38:2:1:242193529:1 REF 3 dna:chromosome chromosome:GRCh38:3:1:198295559:1 REF 4 dna:chromosome chromosome:GRCh38:4:1:190214555:1 REF 5 dna:chromosome chromosome:GRCh38:5:1:181538259:1 REF 6 dna:chromosome chromosome:GRCh38:6:1:170805979:1 REF 7 dna:chromosome chromosome:GRCh38:7:1:159345973:1 REF 8 dna:chromosome chromosome:GRCh38:8:1:145138636:1 REF 9 dna:chromosome chromosome:GRCh38:9:1:138394717:1 REF 10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF 11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF 12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF 13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF 14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF 15 dna:chromosome chromosome:GRCh38:15:1:101991189:1 REF 16 dna:chromosome chromosome:GRCh38:16:1:90338345:1 REF 17 dna:chromosome chromosome:GRCh38:17:1:83257441:1 REF 18 dna:chromosome chromosome:GRCh38:18:1:80373285:1 REF 19 dna:chromosome chromosome:GRCh38:19:1:58617616:1 REF 20 dna:chromosome chromosome:GRCh38:20:1:64444167:1 REF 21 dna:chromosome chromosome:GRCh38:21:1:46709983:1 REF 22 dna:chromosome chromosome:GRCh38:22:1:50818468:1 REF X dna:chromosome chromosome:GRCh38:X:1:156040895:1 REF Y dna:chromosome chromosome:GRCh38:Y:1:57227415:1 REF MT dna:chromosome chromosome:GRCh38:MT:1:16569:1 REF ...

cariocow commented 3 months ago

for your information, i get them from Ensembl. gtf: https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz ref: https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

thanks for your help :)

andrewprzh commented 3 months ago

It seems that the problem is in the BAM file. Could you send me a few lines and a header from the BAM file?

andrewprzh commented 4 days ago

Please reopen if the issue is still there.