chklovski / CheckM2

Assessing the quality of metagenome-derived genome bins using machine learning
GNU General Public License v3.0
175 stars 20 forks source link

UnicodeEncodeError when running testrun and on real data. #112

Open patriciatran opened 3 months ago

patriciatran commented 3 months ago

Hi,

I installed checkm2 using the yml file, and downloaded the database without issue. I get the following unicode encoding error in the testrun, but also when I try it on a small data (3 genomes) of my own real data. Has anyone seen this error, or have advice on how to fix this?

Thank you, Patricia

patricia@sulfur:~$ mamba env create -n checkm2 -f checkm2.yml 
patricia@sulfur:~$ conda activate checkm2
(checkm2) patricia@sulfur:~$ pip install CheckM2
(checkm2) patricia@sulfur:~$ checkm2 -h
          ____ _               _    __  __ ____  
         / ___| |__   ___  ___| | _|  \/  |___ \ 
        | |   | '_ \ / _ \/ __| |/ / |\/| | __) | 
        | |___| | | |  __/ (__|   <| |  | |/ __/  
         \____|_| |_|\___|\___|_|\_\_|  |_|_____| 

                ...::: CheckM2 v1.0.1 :::...

  General usage:
    predict         -> Predict the completeness and contamination of genome bins in a folder.
    testrun         -> Runs Checkm2 on internal test genomes to ensure it runs without errors.
    database        -> Download and set up required CheckM2 DIAMOND database for annotation

  Use checkm2 <command> -h for command-specific help.
(checkm2) patricia@sulfur:~$ checkm2 database --download --path /storage1/data10/databases/checkm2/
[08/13/2024 12:26:42 PM] INFO: Command: Download database. Checking internal path information.
[08/13/2024 12:26:44 PM] INFO: Downloading https://zenodo.org/api/records/5571251/files/checkm2_database.tar.gz/content to /storage1/data10/databases/checkm2/checkm2_database.tar.gz.
100%|###################################################################################| 1.74G/1.74G [01:30<00:00, 19.2MiB/s]
[08/13/2024 12:28:15 PM] INFO: Extracting files from archive...
[08/13/2024 12:28:40 PM] INFO: Verifying version and checksums...
[08/13/2024 12:28:40 PM] INFO: Verification success.
[08/13/2024 12:28:48 PM] INFO: Diamond DATABASE downloaded successfully! Consider running <checkm2 testrun> to verify everything works.
(checkm2) patricia@sulfur:~$ checkm2 testrun
[08/13/2024 12:30:27 PM] INFO: Test run: Running quality prediction workflow on test genomes with 1 threads.
[08/13/2024 12:30:27 PM] INFO: Running checksum on test genomes.
[08/13/2024 12:30:27 PM] INFO: Checksum successful.
[08/13/2024 12:30:29 PM] INFO: Calling genes in 3 bins with 1 threads:
    Finished processing 3 of 3 (100.00%) bins.
[08/13/2024 12:30:58 PM] INFO: Calculating metadata for 3 bins with 1 threads:
    Finished processing 3 of 3 (100.00%) bin metadata.
[08/13/2024 12:30:59 PM] INFO: Annotating input genomes with DIAMOND using 1 threads
Traceback (most recent call last):
  File "/home/patricia/miniconda3/envs/checkm2/bin/checkm2", line 265, in <module>
    predictor.prediction_wf(False, 'auto', False, False, False)
  File "/home/patricia/miniconda3/envs/checkm2/lib/python3.8/site-packages/checkm2/predictQuality.py", line 135, in prediction_wf
    diamond_out = diamond_search.run(prodigal_files)
  File "/home/patricia/miniconda3/envs/checkm2/lib/python3.8/site-packages/checkm2/diamond.py", line 119, in run
    self.__call_diamond(protein_chunks, diamond_out)
  File "/home/patricia/miniconda3/envs/checkm2/lib/python3.8/site-packages/checkm2/diamond.py", line 74, in __call_diamond
    sequenceClasses.SeqReader().write_fasta(seq_object, temp_diamond_input.name)
  File "/home/patricia/miniconda3/envs/checkm2/lib/python3.8/site-packages/checkm2/sequenceClasses.py", line 104, in write_fasta
    fout.write('>' + seqId + '\n')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u03a9' in position 6: ordinal not in range(256)
npbhavya commented 3 months ago

I am also running into the same error. I am running checkM2 v1.0.2

[08/15/2024 10:18:11 AM] INFO: Annotating input genomes with DIAMOND using 30 threads
Traceback (most recent call last):
  File "/home/nala0006/miniconda3/envs/checkm2/bin/checkm2", line 245, in <module>
    args.stdout, args.resume, args.remove_intermediates, args.ttable)
  File "/home/nala0006/miniconda3/envs/checkm2/lib/python3.6/site-packages/checkm2/predictQuality.py", line 135, in prediction_wf
    diamond_out = diamond_search.run(prodigal_files)
  File "/home/nala0006/miniconda3/envs/checkm2/lib/python3.6/site-packages/checkm2/diamond.py", line 119, in run
    self.__call_diamond(protein_chunks, diamond_out)
  File "/home/nala0006/miniconda3/envs/checkm2/lib/python3.6/site-packages/checkm2/diamond.py", line 74, in __call_diamond
    sequenceClasses.SeqReader().write_fasta(seq_object, temp_diamond_input.name)
  File "/home/nala0006/miniconda3/envs/checkm2/lib/python3.6/site-packages/checkm2/sequenceClasses.py", line 104, in write_fasta
    fout.write('>' + seqId + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character '\u03a9' in position 33: ordinal not in range(128)
lyisrae1 commented 3 weeks ago

Can anyone from Checkm2 help us please?