aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
186 stars 29 forks source link

UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 21: ordinal not in range(128) #364

Closed DmitriiSeverinov closed 6 months ago

DmitriiSeverinov commented 7 months ago

Hi all,

I have the encoding error and I have no idea, where it is coming from, as I should not have any unusual symbols.

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Job stats:
job                            count
---------------------------  -------
AUCell_direct                      1
AUCell_extended                    1
all                                1
download_genome_annotations        1
eGRN_direct                        1
eGRN_extended                      1
get_search_space                   1
motif_enrichment_cistarget         1
motif_enrichment_dem               1
prepare_GEX_ACC_multiome           1
prepare_menr                       1
region_to_gene                     1
scplus_mudata                      1
tf_to_gene                         1
total                             14

Select jobs to execute...
Execute 1 jobs...

[Wed Apr 24 16:36:57 2024]
localrule motif_enrichment_cistarget:
    input: /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/region_sets, /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/scATAC_1000_INs_annotated.regions_vs_motifs.rankings.feather, /projects/p_scads_spinal_cord/motifs_no_cb.tbl
    output: ctx_results.hdf5, ctx_results.html
    jobid: 9
    reason: Missing output files: ctx_results.hdf5
    threads: 64
    resources: tmpdir=/tmp

OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
OMP: Info #277: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Traceback (most recent call last):
  File "/home/dmse952c/.local/bin/scenicplus", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dmse952c/.local/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 1137, in main
    args.func(args)
  File "/home/dmse952c/.local/lib/python3.11/site-packages/scenicplus/cli/scenicplus.py", line 386, in motif_enrichment_cistarget
    run_motif_enrichment_cistarget(
  File "/home/dmse952c/.local/lib/python3.11/site-packages/scenicplus/cli/commands.py", line 193, in run_motif_enrichment_cistarget
    cistarget_result.write_hdf5(
  File "/home/dmse952c/.local/lib/python3.11/site-packages/pycistarget/motif_enrichment_result.py", line 197, in write_hdf5
    motif_enrichment = motif_enrichment.astype(
                       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmse952c/.local/lib/python3.11/site-packages/pandas/core/generic.py", line 6231, in astype
    res_col = col.astype(dtype=cdt, copy=copy, errors=errors)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmse952c/.local/lib/python3.11/site-packages/pandas/core/generic.py", line 6245, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmse952c/.local/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 446, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmse952c/.local/lib/python3.11/site-packages/pandas/core/internals/managers.py", line 348, in apply
    applied = getattr(b, f)(**kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmse952c/.local/lib/python3.11/site-packages/pandas/core/internals/blocks.py", line 527, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmse952c/.local/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmse952c/.local/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", line 230, in astype_array
    values = astype_nansafe(values, dtype, copy=copy)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dmse952c/.local/lib/python3.11/site-packages/pandas/core/dtypes/astype.py", line 170, in astype_nansafe
    return arr.astype(dtype, copy=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 21: ordinal not in range(128)
[Wed Apr 24 16:38:28 2024]
Error in rule motif_enrichment_cistarget:
    jobid: 9
    input: /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/region_sets, /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/scATAC_1000_INs_annotated.regions_vs_motifs.rankings.feather, /projects/p_scads_spinal_cord/motifs_no_cb.tbl
    output: ctx_results.hdf5, ctx_results.html
    shell:

            scenicplus grn_inference motif_enrichment_cistarget                 --region_set_folder /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/region_sets                 --cistarget_db_fname /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/scATAC_1000_INs_annotated.regions_vs_motifs.rankings.feather                 --output_fname_cistarget_result ctx_results.hdf5                 --temp_dir /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated_scplus/tmp/                 --species danio_rerio                 --fr_overlap_w_ctx_db 0.4                 --auc_threshold 0.005                 --nes_threshold 3.0                 --rank_threshold 0.05                 --path_to_motif_annotations /projects/p_scads_spinal_cord/motifs_no_cb.tbl                 --annotation_version v10nr_clust                 --motif_similarity_fdr 0.001                 --orthologous_identity_threshold 0.0                 --annotations_to_use Direct_annot Orthology_annot                 --write_html                 --output_fname_cistarget_html ctx_results.html                 --n_cpu 64

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job motif_enrichment_cistarget since they might be corrupted:
ctx_results.hdf5, ctx_results.html
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-04-24T163657.826425.snakemake.log
WorkflowError:
At least one job did not complete successfully.

And if check my locale, seems like everything is in UTF encoding

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_GB.UTF-8
LC_TIME=en_GB.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_GB.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_GB.UTF-8
LC_NAME=en_GB.UTF-8
LC_ADDRESS=en_GB.UTF-8
LC_TELEPHONE=en_GB.UTF-8
LC_MEASUREMENT=en_GB.UTF-8
LC_IDENTIFICATION=en_GB.UTF-8
LC_ALL=

I have also tried to export the following variables to the environment:

export PYTHONIOENCODING=utf8
export LANG=C.UTF-8
export LC_ALL=C.UTF-8
export PYTHONUTF8=1

Didn't help :(

Best, Dmitrii

Version (please complete the following information):

SeppeDeWinter commented 6 months ago

Hi @DmitriiSeverinov

This error I have never seen before ... Would you be able to step through the code that the command is running manually?

This command is causing your issue


scenicplus grn_inference motif_enrichment_cistarget \
                 --region_set_folder /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/region_sets \
                 --cistarget_db_fname /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/scATAC_1000_INs_annotated.regions_vs_motifs.rankings.feather \
                 --output_fname_cistarget_result ctx_results.hdf5 \
                 --temp_dir /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated_scplus/tmp/  \
                --species danio_rerio \
                 --fr_overlap_w_ctx_db 0.4 \
                 --auc_threshold 0.005 \
                 --nes_threshold 3.0 \
                 --rank_threshold 0.05 \
                 --path_to_motif_annotations /projects/p_scads_spinal_cord/motifs_no_cb.tbl \
                 --annotation_version v10nr_clust \
                 --motif_similarity_fdr 0.001 \
                 --orthologous_identity_threshold 0.0 \
                 --annotations_to_use Direct_annot Orthology_annot \
                 --write_html \
                 --output_fname_cistarget_html ctx_results.html \
                 --n_cpu 64

So in a python environment run the following


import logging
import os
import pathlib
import pickle
import shutil
import sys
from typing import Callable, Dict, Iterator, List, Literal, Optional, Tuple, Union

import joblib
import mudata
import pandas as pd
import pyranges as pr
from importlib_resources import files
from pycistarget.motif_enrichment_cistarget import cisTarget
from pycistarget.motif_enrichment_dem import DEM

from scenicplus.grn_builder.modules import eRegulon

# variables
region_set_folder = "data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/region_sets"
cistarget_db_fname = "/data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/scATAC_1000_INs_annotated.regions_vs_motifs.rankings.feather"
output_fname_cistarget_result = "ctx_results.hdf5"
n_cpu = 64
fraction_overlap_w_cistarget_database = 0.4
auc_threshold = 0.005
nes_threshold = 3.0
rank_threshold = 0.05
path_to_motif_annotations = "/projects/p_scads_spinal_cord/motifs_no_cb.tbl"
annotation_version = "v10nr_clust"
motif_similarity_fdr = 0.001
orthologous_identity_threshold = 0.0
temp_dir = "/data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated_scplus/tmp/"
species = "danio_rerio"
annotations_to_use = ["Direct_annot", "Orthology_annot"]
write_html =  True
output_fname_cistarget_html = "ctx_results.html"

# Run motif enrichment

region_set_dict: Dict[str, pr.PyRanges] = {}
log.info(f"Reading region sets from: {region_set_folder}")
for region_set_subfolder in os.listdir(region_set_folder):
    if os.path.isdir(os.path.join(region_set_folder, region_set_subfolder)):
        log.info(f"Reading all .bed files in: {region_set_subfolder}")
        if any(
            f.endswith(".bed")
                for f in
                os.listdir(
                    os.path.join(
                        region_set_folder, region_set_subfolder
                    )
                )
            ):
            for f in os.listdir(os.path.join(region_set_folder, region_set_subfolder)):
                if f.endswith(".bed"):
                    key_name = region_set_subfolder + "_" + f.replace(".bed", "")
                    if key_name in region_set_dict:
                        raise ValueError(
                            f"non unique folder/file combination: {key_name}"
                        )
                    region_set_dict[key_name] = pr.read_bed(
                            os.path.join(
                                    region_set_folder, region_set_subfolder, f
                            ),
                            as_df=False
                    )

cistarget_results: List[cisTarget] = joblib.Parallel(
    n_jobs=n_cpu,
    temp_folder=temp_dir
)(
    joblib.delayed(
        _run_cistarget_single_region_set
    )(
        name = key,
        region_set=region_set_dict[key],
        cistarget_db_fname=cistarget_db_fname,
        fraction_overlap_w_cistarget_database=fraction_overlap_w_cistarget_database,
        species=species,
        auc_threshold=auc_threshold,
        nes_threshold=nes_threshold,
        rank_threshold=rank_threshold,
        path_to_motif_annotations=path_to_motif_annotations,
        annotation_version=annotation_version,
        annotations_to_use=annotations_to_use,
        motif_similarity_fdr=motif_similarity_fdr,
        orthologous_identity_threshold=orthologous_identity_threshold
    )
    for key in region_set_dict
)
# Write results to file
if write_html:
    log.info(f"Writing html to: {output_fname_cistarget_html}")
    all_motif_enrichment_df = pd.concat(
        ctx_result.motif_enrichment for ctx_result in cistarget_results
    )
    all_motif_enrichment_df.to_html(
        buf = output_fname_cistarget_html,
        escape = False,
        col_space = 80
    )

import pickle
import os
# This step produces your error! I modified the code so it will save the object as a pickle if an error occurs
log.info(f"Writing output to: {output_fname_cistarget_result}")
for i, cistarget_result in enumerate(cistarget_results):
    if len(cistarget_result.motif_enrichment) > 0:
        try:
            cistarget_result.write_hdf5(
                path = output_fname_cistarget_result,
                mode = "a"
            )
        except Exception as e:
            with open(os.path.join(temp_dir, f"{i}.pickle")) as f:
                pickle.dump(cistarget_result, f)

Could you share one of those pickle files with me?

Best,

Seppe

DmitriiSeverinov commented 6 months ago

Hi @SeppeDeWinter ,

Thanks for providing the code to troubleshoot my error. Now I managed to find what caused it and now scenicplus grn_inference motif_enrichment_cistarget works! So, I had a cell type that had a dash (and by some reasons it got encoded as a character with the code \u2013) in its name and, apparently, it plays a huge role. So, I replaced the dash with "_" it and now it works... I do not know at which step a normal dash got encoded like this, but from now on I will use only underscores :)

Best, Dmitrii

SeppeDeWinter commented 6 months ago

Hi Dmitrii

That's great!

Good luck with the analysis.

Best,

S