Closed DmitriiSeverinov closed 6 months ago
Hi @DmitriiSeverinov
This error I have never seen before ... Would you be able to step through the code that the command is running manually?
This command is causing your issue
scenicplus grn_inference motif_enrichment_cistarget \
--region_set_folder /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/region_sets \
--cistarget_db_fname /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/scATAC_1000_INs_annotated.regions_vs_motifs.rankings.feather \
--output_fname_cistarget_result ctx_results.hdf5 \
--temp_dir /data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated_scplus/tmp/ \
--species danio_rerio \
--fr_overlap_w_ctx_db 0.4 \
--auc_threshold 0.005 \
--nes_threshold 3.0 \
--rank_threshold 0.05 \
--path_to_motif_annotations /projects/p_scads_spinal_cord/motifs_no_cb.tbl \
--annotation_version v10nr_clust \
--motif_similarity_fdr 0.001 \
--orthologous_identity_threshold 0.0 \
--annotations_to_use Direct_annot Orthology_annot \
--write_html \
--output_fname_cistarget_html ctx_results.html \
--n_cpu 64
So in a python environment run the following
import logging
import os
import pathlib
import pickle
import shutil
import sys
from typing import Callable, Dict, Iterator, List, Literal, Optional, Tuple, Union
import joblib
import mudata
import pandas as pd
import pyranges as pr
from importlib_resources import files
from pycistarget.motif_enrichment_cistarget import cisTarget
from pycistarget.motif_enrichment_dem import DEM
from scenicplus.grn_builder.modules import eRegulon
# variables
region_set_folder = "data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/region_sets"
cistarget_db_fname = "/data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated/scATAC_1000_INs_annotated.regions_vs_motifs.rankings.feather"
output_fname_cistarget_result = "ctx_results.hdf5"
n_cpu = 64
fraction_overlap_w_cistarget_database = 0.4
auc_threshold = 0.005
nes_threshold = 3.0
rank_threshold = 0.05
path_to_motif_annotations = "/projects/p_scads_spinal_cord/motifs_no_cb.tbl"
annotation_version = "v10nr_clust"
motif_similarity_fdr = 0.001
orthologous_identity_threshold = 0.0
temp_dir = "/data/horse/ws/dmse952c-zebrafish_multiome/results/test_multiome/scATAC_1000_INs_annotated_scplus/tmp/"
species = "danio_rerio"
annotations_to_use = ["Direct_annot", "Orthology_annot"]
write_html = True
output_fname_cistarget_html = "ctx_results.html"
# Run motif enrichment
region_set_dict: Dict[str, pr.PyRanges] = {}
log.info(f"Reading region sets from: {region_set_folder}")
for region_set_subfolder in os.listdir(region_set_folder):
if os.path.isdir(os.path.join(region_set_folder, region_set_subfolder)):
log.info(f"Reading all .bed files in: {region_set_subfolder}")
if any(
f.endswith(".bed")
for f in
os.listdir(
os.path.join(
region_set_folder, region_set_subfolder
)
)
):
for f in os.listdir(os.path.join(region_set_folder, region_set_subfolder)):
if f.endswith(".bed"):
key_name = region_set_subfolder + "_" + f.replace(".bed", "")
if key_name in region_set_dict:
raise ValueError(
f"non unique folder/file combination: {key_name}"
)
region_set_dict[key_name] = pr.read_bed(
os.path.join(
region_set_folder, region_set_subfolder, f
),
as_df=False
)
cistarget_results: List[cisTarget] = joblib.Parallel(
n_jobs=n_cpu,
temp_folder=temp_dir
)(
joblib.delayed(
_run_cistarget_single_region_set
)(
name = key,
region_set=region_set_dict[key],
cistarget_db_fname=cistarget_db_fname,
fraction_overlap_w_cistarget_database=fraction_overlap_w_cistarget_database,
species=species,
auc_threshold=auc_threshold,
nes_threshold=nes_threshold,
rank_threshold=rank_threshold,
path_to_motif_annotations=path_to_motif_annotations,
annotation_version=annotation_version,
annotations_to_use=annotations_to_use,
motif_similarity_fdr=motif_similarity_fdr,
orthologous_identity_threshold=orthologous_identity_threshold
)
for key in region_set_dict
)
# Write results to file
if write_html:
log.info(f"Writing html to: {output_fname_cistarget_html}")
all_motif_enrichment_df = pd.concat(
ctx_result.motif_enrichment for ctx_result in cistarget_results
)
all_motif_enrichment_df.to_html(
buf = output_fname_cistarget_html,
escape = False,
col_space = 80
)
import pickle
import os
# This step produces your error! I modified the code so it will save the object as a pickle if an error occurs
log.info(f"Writing output to: {output_fname_cistarget_result}")
for i, cistarget_result in enumerate(cistarget_results):
if len(cistarget_result.motif_enrichment) > 0:
try:
cistarget_result.write_hdf5(
path = output_fname_cistarget_result,
mode = "a"
)
except Exception as e:
with open(os.path.join(temp_dir, f"{i}.pickle")) as f:
pickle.dump(cistarget_result, f)
Could you share one of those pickle files with me?
Best,
Seppe
Hi @SeppeDeWinter ,
Thanks for providing the code to troubleshoot my error. Now I managed to find what caused it and now scenicplus grn_inference motif_enrichment_cistarget
works!
So, I had a cell type that had a dash (and by some reasons it got encoded as a character with the code \u2013) in its name and, apparently, it plays a huge role. So, I replaced the dash with "_" it and now it works...
I do not know at which step a normal dash got encoded like this, but from now on I will use only underscores :)
Best, Dmitrii
Hi Dmitrii
That's great!
Good luck with the analysis.
Best,
S
Hi all,
I have the encoding error and I have no idea, where it is coming from, as I should not have any unusual symbols.
And if check my locale, seems like everything is in UTF encoding
I have also tried to export the following variables to the environment:
Didn't help :(
Best, Dmitrii
Version (please complete the following information):