UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

KChen-lab / Monopogen

SNV calling from single cell sequencing

GNU General Public License v3.0

71 stars 17 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte #14

Open moOsmanMD opened 12 months ago

moOsmanMD commented 12 months ago

Hello, thank you so much for developing Monopogen.

I get this error when I run the preProcess module. I tried my bam files and also the bam files you provided as example and unfortunately the issue do exist. I also tried it on python 3.8 and python 3.11 however, the issue is not resolved.

for line in f_in:

File "", line 322, in decode UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I would be very grateful if you could help me with that Thanks

jinzhuangdou commented 12 months ago

This seems a format issue. Could you show the value of f_in? It seems no file provided? In addition, which line f line in f_in locates?

moOsmanMD commented 12 months ago

Thanks a lot for your response

this is the output I get

[2023-09-08 15:29:32,726] INFO Monopogen.py Performing data preprocess before variant calling... [2023-09-08 15:29:32,726] INFO germline.py Parameters in effect: [2023-09-08 15:29:32,726] INFO germline.py --subcommand = [preProcess] [2023-09-08 15:29:32,726] INFO germline.py --bamFile = [example/A.bam] [2023-09-08 15:29:32,726] INFO germline.py --out = [outDir] [2023-09-08 15:29:32,726] INFO germline.py --app_path = [apps] [2023-09-08 15:29:32,726] INFO germline.py --max_mismatch = [3] [2023-09-08 15:29:32,726] INFO germline.py --nthreads = [1] Traceback (most recent call last): File "/Users/moo4005/Desktop/monopo/Monopogen/src/Monopogen.py", line 435, in main() File "/Users/moo4005/Desktop/monopo/Monopogen/src/Monopogen.py", line 428, in main args.func(args) File "/Users/moo4005/Desktop/monopo/Monopogen/src/Monopogen.py", line 291, in preProcess for line in f_in: File "", line 322, in decode UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte (sepMono) moo4005@Mohameds-Laptop Monopogen %

slinnarsson commented 11 months ago

The docs say

  -b BAMFILE, --bamFile BAMFILE
                        The bam file for the study sample, the bam file should be sorted. If there are multiple samples, each row with each sample (default: None)

which seems to say you should provide the path to the BAM file but this is wrong. You're getting the unicode error because the code is trying to read a text file, but BAM is binary.

Instead, you should provide the path to a text file that lists the samples and their BAM files. For the example, this file worked for me (save as example/bam.lst):

A,example/A.bam
B,example/B.bam

And then run python src/Monopogen.py preProcess -b example/bam.lst -o out -a apps

moOsmanMD commented 11 months ago

@slinnarsson thank you so much for your help.

I have created a bam.lst file in the example folder and put in it

A,example/A.bam
B,example/B.bam

I ran this python src/Monopogen.py preProcess -b example/A.bam -o out -a apps

but I still get the same error. could you please explain a little bit more how to solve it?

Thanks

slinnarsson commented 11 months ago

Sorry you should provide the list file not the Bam: ”-b example/bam.lst”

moOsmanMD commented 11 months ago

@slinnarsson thanks a lot! would you please instruct me how to write the names of the bams in the .lst file.