aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
178 stars 28 forks source link

multiple unhandled error at run_scenicplus() #198

Closed pchiang5 closed 1 year ago

pchiang5 commented 1 year ago

Describe the bug Hello,

The errors below occurred with the following input. They consistently showed up whether I used n_cpu of 1, 6, or 10.

To Reproduce

from scenicplus.wrappers.run_scenicplus import run_scenicplus
try:
    run_scenicplus(
        scplus_obj = scplus_obj,
        variable = ['GEX_celltype'],
        species = 'hsapiens',
        assembly = 'hg38',
        tf_file = '/mnt/c/Users/pc/Downloads/utoronto_human_tfs_v_1.01.txt',
        save_path = os.path.join(work_dir, 'scenicplus'),
        biomart_host = biomart_host,
        upstream = [1000, 150000],
        downstream = [1000, 150000],
        calculate_TF_eGRN_correlation = True,
        calculate_DEGs_DARs = True,
        export_to_loom_file = True,
        export_to_UCSC_file = True,
        path_bedToBigBed = '/mnt/c/Users/pc/Downloads',
        n_cpu = 1, #20 might be too high that overflew the RAM
        _temp_dir = None)
except Exception as e:
    #in case of failure, still save the object
    dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)
    raise(e)

Error output

2023-08-10 16:24:55,050 SCENIC+_wrapper INFO /mnt/c/Users/pc/Downloads/scenicplus folder already exists. 2023-08-10 16:24:55,050 SCENIC+_wrapper INFO Merging cistromes 2023-08-10 16:25:42,185 SCENIC+_wrapper INFO Getting search space 2023-08-10 16:25:44,365 R2G INFO Downloading gene annotation from biomart dataset: hsapiens_gene_ensembl 2023-08-10 16:26:08,179 R2G INFO Downloading chromosome sizes from: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes 2023-08-10 16:26:09,771 R2G INFO Extending promoter annotation to 10 bp upstream and 10 downstream Warning! Start and End columns now have different dtypes: int32 and int64 Warning! Start and End columns now have different dtypes: int32 and int64 2023-08-10 16:26:13,288 R2G INFO Extending search space to: 150000 bp downstream of the end of the gene. 150000 bp upstream of the start of the gene. Warning! Start and End columns now have different dtypes: int32 and int64 Warning! Start and End columns now have different dtypes: int32 and int64 2023-08-10 16:26:35,081 R2G INFO Intersecting with regions. join: Strand data from other will be added as strand data to self. If this is undesired use the flag apply_strand_suffix=False. To turn off the warning set apply_strand_suffix to True or False. Warning! Start and End columns now have different dtypes: int32 and int64 2023-08-10 16:26:36,028 R2G INFO Calculating distances from region to gene 2023-08-10 16:27:02,965 R2G INFO Imploding multiple entries per region and gene 2023-08-10 16:30:15,883 R2G INFO Done! 2023-08-10 16:30:16,153 SCENIC+_wrapper INFO Inferring region to gene relationships 2023-08-10 16:30:16,265 R2G INFO Calculating region to gene importances, using GBM method 2023-08-10 16:30:20,135 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 initializing: 100%|████████████████████████████████████████████████████████████████████████████| 16456/16456 [14:04<00:00, 19.49it/s] Running using 1 cores: 7%|████▌ | 1100/16456 [02:56<41:01, 6.24it/s] ray::_score_regions_to_single_gene_ray() (pid=3808, ip=172.31.110.212) File "/mnt/c/Users/pc/Downloads/scenicplus/src/scenicplus/enhancer_to_gene.py", line 452, in _score_regions_to_single_gene_ray return _score_regions_to_single_gene(X, y, gene_name, region_names, regressor_type, regressor_kwargs) File "/mnt/c/Users/pc/Downloads/scenicplus/src/scenicplus/enhancer_to_gene.py", line 469, in _score_regions_to_single_gene fitted_model = arboreto_core.fit_model(regressor_type=regressor_type, File "/home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/arboreto/core.py", line 143, in fit_model return do_sklearn_regression() File "/home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/arboreto/core.py", line 138, in do_sklearn_regression regressor.fit(tf_matrix, target_gene_expression) File "/home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/sklearn/base.py", line 1151, in wrapper return fit_method(estimator, *args, **kwargs) File "/home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/sklearn/ensemble/_gb.py", line 424, in fit y = column_or_1d(y, warn=True) File "/home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/sklearn/utils/validation.py", line 1245, in column_or_1d raise ValueError( ValueError: y should be a 1d array, got an array of shape (4164, 2) instead. 2023-08-10 16:47:26,014 R2G INFO Took 1029.7491767406464 seconds 2023-08-10 16:47:26,015 R2G INFO Calculating region to gene correlation, using SR method 2023-08-10 16:47:30,151 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 initializing: 4%|███ | 645/16456 [00:34<14:57, 17.62it/s](_score_regions_to_single_gene_ray pid=8620) /home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4916: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined. (_score_regions_to_single_gene_ray pid=8620) warnings.warn(stats.ConstantInputWarning(warn_msg)) initializing: 7%|█████▍ | 1155/16456 [01:04<15:42, 16.23it/s](_score_regions_to_single_gene_ray pid=8620) /home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4916: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined. (_score_regions_to_single_gene_ray pid=8620) warnings.warn(stats.ConstantInputWarning(warn_msg)) initializing: 7%|█████▋ | 1228/16456 [01:09<15:10, 16.72it/s]2023-08-10 16:48:40,267 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::_score_regions_to_single_gene_ray() (pid=8620, ip=172.31.110.212) File "/mnt/c/Users/pc/Downloads/scenicplus/src/scenicplus/enhancer_to_gene.py", line 452, in _score_regions_to_single_gene_ray return _score_regions_to_single_gene(X, y, gene_name, region_names, regressor_type, regressor_kwargs) File "/mnt/c/Users/pc/Downloads/scenicplus/src/scenicplus/enhancer_to_gene.py", line 488, in _score_regions_to_single_gene return pd.Series(correlation_coef, index=region_names), gene_name File "/home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/pandas/core/series.py", line 509, in init data = sanitize_array(data, index, dtype, copy) File "/home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/pandas/core/construction.py", line 607, in sanitize_array subarr = _sanitize_ndim(subarr, data, dtype, index, allow_2d=allow_2d) File "/home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/pandas/core/construction.py", line 666, in _sanitize_ndim raise ValueError( ValueError: Data must be 1-dimensional, got ndarray of shape (15, 3, 3) instead initializing: 11%|████████▏ | 1753/16456 [01:36<12:03, 20.33it/s](_score_regions_to_single_gene_ray pid=8620) /home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4916: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined. (_score_regions_to_single_gene_ray pid=8620) warnings.warn(stats.ConstantInputWarning(warn_msg)) initializing: 15%|███████████▊ | 2521/16456 [02:16<11:45, 19.76it/s](_score_regions_to_single_gene_ray pid=8620) /home/pc/miniconda3/envs/scenicplus/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4916: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined. (_score_regions_to_single_gene_ray pid=8620) warnings.warn(stats.ConstantInputWarning(warn_msg)) initializing: 22%|████████████████▌ | 3545/16456 [03:12<13:12, 16.28it/s]

Expected behavior It finished without any warning messages

Screenshots If applicable, add screenshots to help explain your problem or show the format of the input data for the command/s.

Version (please complete the following information):

Additional context Add any other context about the problem here.

pchiang5 commented 1 year ago

It is likely due to duplicated var_names in the original adata. Withadata.var_names_make_unique(), now it went fine at least to the following stage.

2023-08-10 17:25:04,295 SCENIC+_wrapper INFO /mnt/c/Users/pc/Downloads/scenicplus folder already exists. 2023-08-10 17:25:04,295 SCENIC+_wrapper INFO Inferring region to gene relationships 2023-08-10 17:25:04,707 R2G INFO Calculating region to gene importances, using GBM method 2023-08-10 17:25:09,264 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 initializing: 43%|████████████████████████████████▊ | 7010/16456 [04:55<06:04, 25.88it/s]initializing: 100%|████████████████████████████████████████████████████████████████████████████| 16456/16456 [12:17<00:00, 22.30it/s] Running using 1 cores: 11%|███████▎ | 1836/16456 [15:38<4:07:30, 1.02s/it]