aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
443 stars 182 forks source link

OverflowError: cannot serialize a bytes object larger than 4 GiB[Error] #502

Open liliyuan001 opened 1 year ago

liliyuan001 commented 1 year ago

Describe the bug Hi all, Hi, I kept having this error: 'OverflowError: cannot serialize a bytes object larger than 4 GiB' when I run arboreto_with_multiprocessing.py \ sample.loom \ $tfs \ --method grnboost2 \ --output adj.sample.tsv \ --num_workers 40 \ --seed 777

Do you have any suggestion how to fixed this error?

Mote that most errors are due to the input from the user, and therefore should be treated as questions in the Discussions. Please, only report them as bugs if you are quite certain that they are not behaving as expected.

Steps to reproduce the behavior

  1. Command run when the error occurred:

    conda activate pyscenic cd /home/data/t220416/Melanoma/3_pyscenic/result_AM cat >change.py import os,sys os.getcwd() os.listdir(os.getcwd()) import loompy as lp; import numpy as np; import scanpy as sc; x=sc.read_csv("for.scenic.data.csv"); row_attrs = {"Gene": np.array(x.var_names),}; col_attrs = {"CellID": np.array(x.obs_names)}; lp.create("sample.loom",x.X.transpose(),row_attrs,col_attrs);

python change.py

cat >scenic.bash

dir=/home/data/t220416/Melanoma/3_pyscenic/0_data/index_genome/cisTarget_databases/hg38

tfs=$dir/hs_hgnc_tfs.txt feather=$dir/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather tbl=$dir/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl

input_loom=./sample.loom ls $tfs $feather $tbl

arboreto_with_multiprocessing.py \ sample.loom \ $tfs \ --method grnboost2 \ --output adj.sample.tsv \ --num_workers 40 \ --seed 777

pyscenic ctx \ adj.sample.tsv $feather \ --annotations_fname $tbl \ --expression_mtx_fname $input_loom \ --mode "dask_multiprocessing" \ --output reg.csv \ --num_workers 20 \ --mask_dropouts

pyscenic aucell \ $input_loom \ reg.csv \ --output out_SCENIC.loom \ --num_workers 16

nohup bash scenic.bash 1>pySCENIC.log 2>&1 &

  1. Error encountered:

    nohup: ignoring input
    /home/data/t220416/Melanoma/3_pyscenic/0_data/index_genome/cisTarget_databases/hg38/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather
    /home/data/t220416/Melanoma/3_pyscenic/0_data/index_genome/cisTarget_databases/hg38/hs_hgnc_tfs.txt
    /home/data/t220416/Melanoma/3_pyscenic/0_data/index_genome/cisTarget_databases/hg38/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl
    Loaded expression matrix of 54196 cells and 41361 genes in 48.110148906707764 seconds...
    Loaded 1839 TFs...
    starting grnboost2 using 40 processes...
    
    0%|          | 0/41361 [00:00<?, ?it/s]
    0%|          | 0/41361 [00:00<?, ?it/s]
    Traceback (most recent call last):
    File "/home/data/t220416/miniconda3/envs/pyscenic/bin/arboreto_with_multiprocessing.py", line 198, in <module>
    main()
    File "/home/data/t220416/miniconda3/envs/pyscenic/bin/arboreto_with_multiprocessing.py", line 184, in main
    total=len(gene_names),
    File "/home/data/t220416/miniconda3/envs/pyscenic/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
    File "/home/data/t220416/miniconda3/envs/pyscenic/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
    File "/home/data/t220416/miniconda3/envs/pyscenic/lib/python3.7/multiprocessing/pool.py", line 431, in _handle_tasks
    put(task)
    File "/home/data/t220416/miniconda3/envs/pyscenic/lib/python3.7/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
    File "/home/data/t220416/miniconda3/envs/pyscenic/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
    OverflowError: cannot serialize a bytes object larger than 4 GiB
    usage: pyscenic ctx [-h] [-o OUTPUT] [-n] [--chunk_size CHUNK_SIZE]
                    [--mode {custom_multiprocessing,dask_multiprocessing,dask_cluster}]
                    [-a] [-t] [--rank_threshold RANK_THRESHOLD]
                    [--auc_threshold AUC_THRESHOLD]
                    [--nes_threshold NES_THRESHOLD]
                    [--min_orthologous_identity MIN_ORTHOLOGOUS_IDENTITY]
                    [--max_similarity_fdr MAX_SIMILARITY_FDR]
                    --annotations_fname ANNOTATIONS_FNAME
                    [--num_workers NUM_WORKERS]
                    [--client_or_address CLIENT_OR_ADDRESS]
                    [--thresholds THRESHOLDS [THRESHOLDS ...]]
                    [--top_n_targets TOP_N_TARGETS [TOP_N_TARGETS ...]]
                    [--top_n_regulators TOP_N_REGULATORS [TOP_N_REGULATORS ...]]
                    [--min_genes MIN_GENES]
                    [--expression_mtx_fname EXPRESSION_MTX_FNAME]
                    [--mask_dropouts] [--cell_id_attribute CELL_ID_ATTRIBUTE]
                    [--gene_attribute GENE_ATTRIBUTE] [--sparse]
                    module_fname database_fname [database_fname ...]
    pyscenic ctx: error: argument module_fname: can't open 'adj.sample.tsv': [Errno 2] No such file or directory: 'adj.sample.tsv'
    usage: pyscenic aucell [-h] [-o OUTPUT] [-t] [-w] [--num_workers NUM_WORKERS]
                       [--seed SEED] [--rank_threshold RANK_THRESHOLD]
                       [--auc_threshold AUC_THRESHOLD]
                       [--nes_threshold NES_THRESHOLD]
                       [--cell_id_attribute CELL_ID_ATTRIBUTE]
                       [--gene_attribute GENE_ATTRIBUTE] [--sparse]
                       expression_mtx_fname signatures_fname
    pyscenic aucell: error: argument signatures_fname: can't open 'reg.csv': [Errno 2] No such file or directory: 'reg.csv'

Expected behavior A clear and concise description of what you expected to happen.

Please complete the following information: