'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

IsabelFE commented 4 years ago

I am trying to run prepare in some assemblies I got directly from NCBI and I got this error:

[2020-10-12 16:00:57] : INFO Cutting genomes at each time there are at least 5 'N' in a row, and then, calculating genome size, number of contigs and L90. Analysis: 0/17 (N/A%) - Elapsed Time: 0:00:00 - ETA: --:--:-- Analysis: ██████ 3/17 ( 17%) - Elapsed Time: 0:00:00 - ETA: 0:00:00 Traceback (most recent call last): File "/usr/local/bin/PanACoTA", line 155, in main() File "/usr/local/bin/PanACoTA", line 21, in main action(args) File "/Users/isabelfe/Library/Python/3.8/lib/python/site-packages/PanACoTA/subcommands/prepare.py", line 69, in main_from_parse main(cmd, arguments.NCBI_species, arguments.NCBI_species_taxid, arguments.outdir, File "/Users/isabelfe/Library/Python/3.8/lib/python/site-packages/PanACoTA/subcommands/prepare.py", line 246, in main genomes = fg.check_quality(species_linked, db_dir, tmp_dir, l90, nbcont, cutn) File "/Users/isabelfe/Library/Python/3.8/lib/python/site-packages/PanACoTA/prepare_module/filter_genomes.py", line 104, in check_quality gfunc.analyse_all_genomes(genomes, db_path, tmp_dir, cutn, "prepare", logger, quiet=False) File "/Users/isabelfe/Library/Python/3.8/lib/python/site-packages/PanACoTA/annotate_module/genome_seq_functions.py", line 114, in analyse_all_genomes res = analyse_genome(genome, dbpath, tmp_path, cut, pat, genomes, soft, logger=logger) File "/Users/isabelfe/Library/Python/3.8/lib/python/site-packages/PanACoTA/annotate_module/genome_seq_functions.py", line 185, in analyse_genome for line in genf: File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

The reason I dowloaded them directly from NCBI instead of using prepare is that I am trying to get 17 strains that don't have a species name. They are Corynebacterium sp. strains like this one https://www.ncbi.nlm.nih.gov/assembly/GCF_000477955.1/ I tried to get them using the following, but didn't work: PanACoTA prepare -s "Corynebacterium sp. KPL1989" -o MyGenomes

asetGem commented 4 years ago

Could you send me the file which is bringing to this error please (or add it to the repo you already shared with me)? Which command line did you run to get this error message?

IsabelFE commented 4 years ago

It is line 20 on the HELP file on my repo. The genomes I am analyzing are in Genomes_KPL_Broad/Broad_RawFiles. I got them from NCBI directly because those are Corynebacterium sp. strains like this one https://www.ncbi.nlm.nih.gov/assembly/GCF_000477955.1/ that can not be retrieved directly using PanACoTA prepare -s "Corynebacterium sp. KPL1989" -o Genomes_KPL_Broad/Broad_RawFiles

asetGem commented 4 years ago

I cannot reproduce this on my side. Using the same command line with your genomes runs correctly (and none of your 16 genomes is discarded). Can you try to update to the latest version (git pull and then ./make upgrade: you should now have PanACoTA -V giving 1.0.1.0) and tell me if you still have the same behavior? I may have changed something since then that solved your problem...

IsabelFE commented 4 years ago

I still got the same error:

[2020-10-19 13:45:41] : INFO PanACoTA version 1.0.1 [2020-10-19 13:45:41] : INFO Command used PanACoTA prepare --norefseq -d Genomes_KPL_Broad/Broad_RawFiles -o Genomes_KPL_Broad/Broad_PanACoTA --max 1 --min 0 [2020-10-19 13:45:41] : INFO 'PanACoTA prepare' will run on 1 core [2020-10-19 13:45:41] : INFO Total number of genomes for NA: 17 [2020-10-19 13:45:41] : INFO Cutting genomes at each time there are at least 5 'N' in a row, and then, calculating genome size, number of contigs and L90. Analysis: ██████ 3/17 ( 17%) - Elapsed Time: 0:00:00 - ETA: 0:00:00Traceback (most recent call last): File "/Users/isabelfe/opt/anaconda3/envs/PanACoTa/bin/PanACoTA", line 155, in main() File "/Users/isabelfe/opt/anaconda3/envs/PanACoTa/bin/PanACoTA", line 21, in main action(args) File "/Users/isabelfe/opt/anaconda3/envs/PanACoTa/lib/python3.6/site-packages/PanACoTA/subcommands/prepare.py", line 73, in main_from_parse arguments.max_dist, arguments.verbose, arguments.quiet) File "/Users/isabelfe/opt/anaconda3/envs/PanACoTa/lib/python3.6/site-packages/PanACoTA/subcommands/prepare.py", line 241, in main genomes = fg.check_quality(species_linked, db_dir, tmp_dir, l90, nbcont, cutn) File "/Users/isabelfe/opt/anaconda3/envs/PanACoTa/lib/python3.6/site-packages/PanACoTA/prepare_module/filter_genomes.py", line 104, in check_quality gfunc.analyse_all_genomes(genomes, db_path, tmp_dir, cutn, "prepare", logger, quiet=False) File "/Users/isabelfe/opt/anaconda3/envs/PanACoTa/lib/python3.6/site-packages/PanACoTA/annotate_module/genome_seq_functions.py", line 114, in analyse_all_genomes res = analyse_genome(genome, dbpath, tmp_path, cut, pat, genomes, soft, logger=logger) File "/Users/isabelfe/opt/anaconda3/envs/PanACoTa/lib/python3.6/site-packages/PanACoTA/annotate_module/genome_seq_functions.py", line 185, in analyse_genome for line in genf: File "/Users/isabelfe/opt/anaconda3/envs/PanACoTa/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte (PanACoTa) isabelfe@Isabels-MBP Coryne_Pangenomics %

IsabelFE commented 4 years ago

Should I try to bring these genomes to the pipeline at the annotate step instead of prepare?

asetGem commented 4 years ago

I see that you still have version 1.0.1 (1st line of your log) instead of 1.0.1.0 (yes, that's a weird version name, it is temporary, I will change it ;) ). Seems you did not get the last version!

Here is what I get in PanACoTA_prepare_NA.log:

[2020-10-19 19:01:59] :: INFO :: PanACoTA version 1.0.1.0
[2020-10-19 19:01:59] :: INFO :: Command used
     > PanACoTA prepare --norefseq -d Genomes_KPL_Broad/Broad_RawFiles -o HELP2 --max 1 --min 0
[2020-10-19 19:01:59] :: INFO :: 'PanACoTA prepare' will run on 1 core
[2020-10-19 19:01:59] :: WARNING :: You asked to skip refseq downloads.
[2020-10-19 19:01:59] :: INFO :: Total number of genomes for NA: 16
[2020-10-19 19:01:59] :: INFO :: Cutting genomes at each time there are at least 5 'N' in a row, and the
n, calculating genome size, number of contigs and L90.
[2020-10-19 19:02:00] :: INFO :: Sorting all 16 genomes by quality
[2020-10-19 19:02:00] :: INFO :: 16 genomes after quality control (0 discarded)
[2020-10-19 19:02:00] :: INFO :: Starting filtering steps according to distance between genomes.
[2020-10-19 19:02:00] :: INFO :: Sketching all genomes...
[2020-10-19 19:02:02] :: INFO :: Computing pairwise distances between all genomes
[2020-10-19 19:02:02] :: INFO :: Reading matrix from txt file generated by Mash.
[2020-10-19 19:02:02] :: INFO :: Saving matrix to npz file to be loaded quicker if needed later
[2020-10-19 19:02:02] :: INFO :: Starting iterative discarding steps
[2020-10-19 19:02:02] :: INFO :: Final number of genomes in dataset: 16
[2020-10-19 19:02:02] :: INFO :: Final list of genomes in the dataset: HELP2/LSTINFO-NA-filtered-0.0_1.0
.txt
[2020-10-19 19:02:02] :: INFO :: List of genomes discarded by minhash steps: HELP2/discarded-by-minhash-
NA-0.0_1.0.txt
[2020-10-19 19:02:02] :: INFO :: End

However, in your case, as you do not want to filter with mash, yes, you can also directly use the annotate module. The prepare module is to filter genomes if you need, but as here you do not want to discard too close or too far genomes, then just use annotate module.

asetGem commented 4 years ago

I think I found what happened! Don't you have a binary file in your raw genome folder (something like .DS_Store if you are using Mac OS?). If it is the case, then prepare module tries to read it, but fails as it is binary. This file is not in your directory Genomes_KPL_Broad/Broad_RawFiles version on the git repo. So I don't have it when I try...and it works. Do you have such a file, and if so, can you try to remove it and rerun to see if it works?

IsabelFE commented 4 years ago

I updated to the most updated version:

(PanACoTa) isabelfe@Isabels-MBP PanACoTA % git pull Already up to date. (PanACoTa) isabelfe@Isabels-MBP PanACoTA % ./make upgrade [2020-10-19 15:30:05] :: Upgrading PanACoTA... Processing /Users/isabelfe/PanACoTA Building wheels for collected packages: PanACoTA Building wheel for PanACoTA (setup.py) ... done Created wheel for PanACoTA: filename=PanACoTA-1.0.1-py3-none-any.whl size=152132 sha256=b33ce25492376922c29c422d29d95a761ef46efdbd9e9f0f2ea492d6aca4ecbc Stored in directory: /private/var/folders/qw/c6klkqqd1cd45vskjdpgxqn00000gn/T/pip-ephem-wheel-cache-514p03v6/wheels/6e/7d/a7/dcba676350eab83dcb4a548e50db6a0e12c6ae85f3763eb64f Successfully built PanACoTA Installing collected packages: PanACoTA Attempting uninstall: PanACoTA Found existing installation: PanACoTA 1.0.1 Uninstalling PanACoTA-1.0.1: Successfully uninstalled PanACoTA-1.0.1 Successfully installed PanACoTA-1.0.1 [2020-10-19 15:30:07] :: DONE (PanACoTa) isabelfe@Isabels-MBP PanACoTA %

IsabelFE commented 4 years ago

However, in your case, as you do not want to filter with mash, yes, you can also directly use the annotate module. The prepare module is to filter genomes if you need, but as here you do not want to discard too close or too far genomes, then just use annotate module.

For now I just want to include all the genomes in the initial analysis, but later I might need to use mash

IsabelFE commented 4 years ago

I think I found what happened! Don't you have a binary file in your raw genome folder (something like .DS_Store if you are using Mac OS?). If it is the case, then prepare module tries to read it, but fails as it is binary. This file is not in your directory Genomes_KPL_Broad/Broad_RawFiles version on the git repo. So I don't have it when I try...and it works. Do you have such a file, and if so, can you try to remove it and rerun to see if it works?

Thanks!! That fixed the issue!

asetGem commented 4 years ago

I updated to the most updated version:

(PanACoTa) isabelfe@Isabels-MBP PanACoTA % git pull Already up to date. (PanACoTa) isabelfe@Isabels-MBP PanACoTA % ./make upgrade [2020-10-19 15:30:05] :: Upgrading PanACoTA... Processing /Users/isabelfe/PanACoTA Building wheels for collected packages: PanACoTA Building wheel for PanACoTA (setup.py) ... done Created wheel for PanACoTA: filename=PanACoTA-1.0.1-py3-none-any.whl size=152132 sha256=b33ce25492376922c29c422d29d95a761ef46efdbd9e9f0f2ea492d6aca4ecbc Stored in directory: /private/var/folders/qw/c6klkqqd1cd45vskjdpgxqn00000gn/T/pip-ephem-wheel-cache-514p03v6/wheels/6e/7d/a7/dcba676350eab83dcb4a548e50db6a0e12c6ae85f3763eb64f Successfully built PanACoTA Installing collected packages: PanACoTA Attempting uninstall: PanACoTA Found existing installation: PanACoTA 1.0.1 Uninstalling PanACoTA-1.0.1: Successfully uninstalled PanACoTA-1.0.1 Successfully installed PanACoTA-1.0.1 [2020-10-19 15:30:07] :: DONE (PanACoTa) isabelfe@Isabels-MBP PanACoTA %

my bad, I did not push the changes on github, so you couldn't get this version. But anyway, this was not the problem here!

IsabelFE commented 4 years ago

.DS_Store files are in my .gitignore, that is the reason the file was not in the GitHub repo version of my folder. Good call!!

asetGem commented 4 years ago

Can I ask you to open another issue for your PanACoTA prepare -s "Corynebacterium sp. KPL1989" -o MyGenomes does not work? As it is an independent problem, I'll answer there, to help users find answers to their problems.

IsabelFE commented 4 years ago

OK, I will do so, more than a problem, I guess that issue is an improvement, to be able to get individual assemblies using their name or their assembly ID

asetGem commented 3 years ago

Just to let you know, I now added a check to avoid this situation. When seeing a binary file, it returns an error message saying that this file will be ignored.

gem-pasteur / PanACoTA

'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte #6