HCA-integration / sc-atlasing-toolbox

Pipelines for ensuring the quality of HCA integrated atlases
MIT License
15 stars 2 forks source link

Potential bug for the run_example.sh #201

Closed Feilijiang closed 3 weeks ago

Feilijiang commented 1 month ago

Hi, Thank you for the amazing tool.

I installed sc-atlasing-toolbox and tried to run provided run_example.sh. I want to use it to test the environment and the example data is default data. However, I met an error like below. Could you please help me with it? Thank you so much.

(snakemake) lf15@node-11-10-1:/nfs/team205/lf15/Package/hca_integration_toolbox/hca_integration_toolbox$ bash run_example.sh preprocessing_all integration_all metrics_all -c 5 > out.log
++ realpath ../hca_integration_toolbox
+ pipeline=/nfs/team205/lf15/Package/hca_integration_toolbox/hca_integration_toolbox
+ snakemake --profile .profiles/local --configfile configs/example_config.yaml --snakefile /nfs/team205/lf15/Package/hca_integration_toolbox/hca_integration_toolbox/workflow/Snakefile --use-conda --rerun-incomplete --keep-going --printshellcmds preprocessing_all integration_all metrics_all -c 5
Using profile .profiles/local for setting default command line arguments.
Config file configs/outputs.yaml is extended by additional config specified via the command line.
Config file configs/load_data/config.yaml is extended by additional config specified via the command line.
Config file configs/exploration/config.yaml is extended by additional config specified via the command line.
WARNING:ModuleConfig: No default target specified for module "filter", using default:
"data/out/filter/dataset~{dataset}/file_id~{file_id}.zarr..."
WARNING:ModuleConfig: No default target specified for module "exploration", using default:
"data/out/exploration/dataset~{dataset}/file_id~{file_id}.zarr..."
WARNING:ModuleConfig: No default target specified for module "batch_analysis", using default:
"data/out/batch_analysis/dataset~{dataset}/file_id~{file_id}.zarr..."
WARNING:ModuleConfig: No default target specified for module "doublets", using default:
"data/out/doublets/dataset~{dataset}/file_id~{file_id}.zarr..."
WARNING:ModuleConfig: No default target specified for module "qc", using default:
"data/out/qc/dataset~{dataset}/file_id~{file_id}.zarr..."
WARNING:ModuleConfig: No default target specified for module "label_transfer", using default:
"data/out/label_transfer/dataset~{dataset}/file_id~{file_id}.zarr..."
WARNING:ModuleConfig: No default target specified for module "label_harmonization", using default:
"data/out/label_harmonization/dataset~{dataset}/file_id~{file_id}.zarr..."
WARNING:ModuleConfig: No default target specified for module "integration", using default:
"data/out/integration/dataset~{dataset}/file_id~{file_id}/method~{method}/batch~{batch}/label~{label}/var_mask~{var_mask}/output_type~{output_type}.zarr..."
WARNING:ModuleConfig: No default target specified for module "marker_genes", using default:
"data/out/marker_genes/dataset~{dataset}/file_id~{file_id}.zarr..."
Building DAG of jobs...
Using shell: /usr/local/bin/bash
Provided cores: 5
Rules claiming more threads will be scaled down.
Singularity containers: ignored
Job stats:
job                                  count
---------------------------------  -------
integration_all                          1
integration_barplot_per_dataset          3
integration_benchmark_per_dataset        1
integration_compute_umap                 6
integration_plot_umap                    6
integration_postprocess                  6
integration_prepare                      1
integration_run_method                   3
metrics_all                              1
metrics_barplot                          3
metrics_barplot_per_dataset              3
metrics_barplot_per_file                18
metrics_cluster                         60
metrics_cluster_collect                  6
metrics_funkyheatmap                     1
metrics_funkyheatmap_per_batch           1
metrics_funkyheatmap_per_dataset         1
metrics_funkyheatmap_per_label           1
metrics_merge_metrics                    1
metrics_merge_metrics_per_batch          1
metrics_merge_metrics_per_dataset        1
metrics_merge_metrics_per_file           6
metrics_merge_metrics_per_label          1
metrics_prepare                          6
metrics_run                             12
preprocessing_all                        1
preprocessing_assemble                   1
preprocessing_neighbors                  1
preprocessing_pca                        1
preprocessing_plot_pca                   1
preprocessing_plot_umap                  1
preprocessing_umap                       1
total                                  157

Select jobs to execute...

[Tue Nov  5 23:52:49 2024]
rule preprocessing_pca:
    input: data/out/preprocessing/dataset~my_dataset/file_id~file_1/highly_variable_genes.zarr
    output: data/out/preprocessing/dataset~my_dataset/file_id~file_1/pca.zarr
    jobid: 9
    reason: Missing output files: data/out/preprocessing/dataset~my_dataset/file_id~file_1/pca.zarr
    wildcards: dataset=my_dataset, file_id=file_1
    resources: tmpdir=/tmp, partition=, qos=, gpu=, mem_mb=

python -c "from __future__ import print_function; import sys, json; print(json.dumps([sys.version_info.major, sys.version_info.minor]))"
Activating conda environment: rapids_singlecell
python /nfs/team205/lf15/Package/hca_integration_toolbox/hca_integration_toolbox/.snakemake/scripts/tmpp7t9hpbt.pca.py
Activating conda environment: rapids_singlecell
INFO:root:No GPU found...
INFO:root:Importing rapids failed, using scanpy...
INFO:root:Read "data/out/preprocessing/dataset~my_dataset/file_id~file_1/highly_variable_genes.zarr"...
INFO:root:AnnData object with n_obs × n_vars = 700 × 765
    obs: 'G2M_score', 'S_score', 'batch', 'batch_2', 'bulk_labels', 'is_cd14_mono', 'louvain', 'n_counts', 'n_genes', 'na_column', 'na_str_column', 'percent_mito', 'phase'
    var: 'dispersions', 'dispersions_norm', 'highly_variable', 'highly_variable_2', 'means', 'n_counts'
    uns: 'bulk_labels_colors', 'hvg', 'log1p', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'preprocessing', 'rank_genes_groups', 'umap'
INFO:root:Subset to highly variable genes...
/nfs/team205/lf15/Package/hca_integration_toolbox/hca_integration_toolbox/workflow/preprocessing/rules/../scripts/utils/accessors.py:83: UserWarning: All genes are highly variable, not subsetting
  warnings.warn('All genes are highly variable, not subsetting')
Traceback (most recent call last):
  File "/nfs/team205/lf15/Package/hca_integration_toolbox/hca_integration_toolbox/.snakemake/scripts/tmpp7t9hpbt.pca.py", line 66, in <module>
    adata, _ = subset_hvg(adata, var_column=hvg_key)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/team205/lf15/Package/hca_integration_toolbox/hca_integration_toolbox/workflow/preprocessing/rules/../scripts/utils/accessors.py", line 96, in subset_hvg
    adata = dask_compute(adata, layers=to_memory)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/team205/lf15/Package/hca_integration_toolbox/hca_integration_toolbox/workflow/preprocessing/rules/../scripts/utils/misc.py", line 188, in dask_compute
    return apply_layers(
           ^^^^^^^^^^^^^
  File "/nfs/team205/lf15/Package/hca_integration_toolbox/hca_integration_toolbox/workflow/preprocessing/rules/../scripts/utils/misc.py", line 208, in apply_layers
    adata.X = func(adata.X, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/team205/lf15/Package/hca_integration_toolbox/hca_integration_toolbox/workflow/preprocessing/rules/../scripts/utils/misc.py", line 190, in <lambda>
    func=lambda x: x.compute() if isinstance(x, da.Array) else x,
                   ^^^^^^^^^^^
  File "/nfs/team205/lf15/miniforge3/envs/rapids_singlecell/lib/python3.11/site-packages/dask/base.py", line 372, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/team205/lf15/miniforge3/envs/rapids_singlecell/lib/python3.11/site-packages/dask/base.py", line 660, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/team205/lf15/Package/hca_integration_toolbox/hca_integration_toolbox/workflow/preprocessing/rules/../scripts/utils/sparse_dask.py", line 48, in take_slice
    sliced = x[idx]
             ~^^^^^
  File "/nfs/team205/lf15/miniforge3/envs/rapids_singlecell/lib/python3.11/site-packages/anndata/_core/sparse_dataset.py", line 346, in __getitem__
    sub = mtx[row, col]
          ~~~^^^^^^^^^^
  File "/nfs/team205/lf15/miniforge3/envs/rapids_singlecell/lib/python3.11/site-packages/scipy/sparse/_csr.py", line 24, in __getitem__
    return super().__getitem__(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/team205/lf15/miniforge3/envs/rapids_singlecell/lib/python3.11/site-packages/scipy/sparse/_index.py", line 74, in __getitem__
    return self._get_sliceXslice(row, col)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/team205/lf15/miniforge3/envs/rapids_singlecell/lib/python3.11/site-packages/anndata/_core/sparse_dataset.py", line 166, in _get_sliceXslice
    return super()._get_sliceXslice(row, col)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/team205/lf15/miniforge3/envs/rapids_singlecell/lib/python3.11/site-packages/scipy/sparse/_compressed.py", line 724, in _get_sliceXslice
    return self._get_submatrix(major, minor, copy=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nfs/team205/lf15/miniforge3/envs/rapids_singlecell/lib/python3.11/site-packages/scipy/sparse/_compressed.py", line 885, in _get_submatrix
    return self.copy() if copy else self
           ^^^^^^^^^^^
  File "/nfs/team205/lf15/miniforge3/envs/rapids_singlecell/lib/python3.11/site-packages/anndata/_core/sparse_dataset.py", line 61, in copy
    return super().copy()
           ^^^^^^^^^^^^^^
  File "/nfs/team205/lf15/miniforge3/envs/rapids_singlecell/lib/python3.11/site-packages/scipy/sparse/_data.py", line 96, in copy
    return self._with_data(self.data.copy(), copy=True)
                           ^^^^^^^^^^^^^^
AttributeError: 'Array' object has no attribute 'copy'
[Tue Nov  5 23:53:49 2024]
Error in rule preprocessing_pca:
    jobid: 9
    input: data/out/preprocessing/dataset~my_dataset/file_id~file_1/highly_variable_genes.zarr
    output: data/out/preprocessing/dataset~my_dataset/file_id~file_1/pca.zarr
    conda-env: rapids_singlecell
Feilijiang commented 1 month ago

Hi Michaela, I tried to debug. I found the input files of rule preprocessing_pca are different in preprocessing/rules/assemble.smk and preprocessing/rules/rules.smk . The first one is therules.preprocessing_highly_variable_genes.output.zarr and the second one is the input h5ad.

The first one gives me the same error that when I test the example pipeline in this issue, but the second one is right. The example command use the first input file by default.

Could you please help fix it? Thank you very much.

mumichae commented 1 month ago

The input file definition should be correct, since the inputs and outputs get overwritten, once the rule is imported. You should be able to ignore the file names of the rules in preprocessing/rules/rules.smk. If there was a conflict in input and output, the dryrun would throw you an error and the script wouldn't even start.

I suspect there might either be an issue with the input file being corrupt or that there is a version conflict. Could you show me which version of scipy, dask and anndata you have in the rapids_singlecell environment?

Feilijiang commented 1 month ago

Hi Michaela, Thank you so much for answering it so quickly.

Here are the packages in the rapids_singlecell environment.

Package                  Version
------------------------ ----------
anndata                  0.10.0
array_api_compat         1.9
asciitree                0.3.3
asttokens                2.4.1
bokeh                    3.6.0
Brotli                   1.1.0
cached-property          1.5.2
certifi                  2024.8.30
cffi                     1.17.1
click                    8.1.7
cloudpickle              3.1.0
colorama                 0.4.6
comm                     0.2.2
contourpy                1.3.0
cycler                   0.12.1
cytoolz                  1.0.0
dask                     2024.10.0
dask-expr                1.1.16
dask-glm                 0.3.2
dask-ml                  2024.4.4
debugpy                  1.8.7
decorator                5.1.1
distributed              2024.10.0
exceptiongroup           1.2.2
executing                2.1.0
fasteners                0.17.3
filelock                 3.16.1
fonttools                4.54.1
fsspec                   2024.10.0
get-annotations          0.1.2
h2                       4.1.0
h5py                     3.10.0
harmonypy                0.0.10
hpack                    4.0.0
hyperframe               6.0.1
igraph                   0.11.6
imageio                  2.36.0
importlib_metadata       8.5.0
ipykernel                6.29.5
ipython                  8.29.0
jedi                     0.19.1
Jinja2                   3.1.4
joblib                   1.4.2
jupyter_client           8.6.3
jupyter_core             5.7.2
kiwisolver               1.4.7
lazy_loader              0.4
legacy-api-wrap          1.4
leidenalg                0.10.2
llvmlite                 0.43.0
locket                   1.0.0
louvain                  0.8.2
lz4                      4.3.3
MarkupSafe               3.0.2
matplotlib               3.9.2
matplotlib-inline        0.1.7
mpmath                   1.3.0
msgpack                  1.1.0
multipledispatch         0.6.0
munkres                  1.1.4
natsort                  8.4.0
nest_asyncio             1.6.0
networkx                 3.4.2
numba                    0.60.0
numcodecs                0.13.1
numpy                    1.26.4
nvidia-cublas-cu12       12.4.5.8
nvidia-cuda-cupti-cu12   12.4.127
nvidia-cuda-nvrtc-cu12   12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12        9.1.0.70
nvidia-cufft-cu12        11.2.1.3
nvidia-curand-cu12       10.3.5.147
nvidia-cusolver-cu12     11.6.1.9
nvidia-cusparse-cu12     12.3.1.170
nvidia-nccl-cu12         2.21.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.4.127
packaging                24.1
pandas                   2.2.3
parso                    0.8.4
partd                    1.4.2
patsy                    0.5.6
pexpect                  4.9.0
pickleshare              0.7.5
pillow                   11.0.0
pip                      24.2
platformdirs             4.3.6
prompt_toolkit           3.0.48
psutil                   6.1.0
ptyprocess               0.7.0
pure_eval                0.2.3
pyarrow                  17.0.0
pycparser                2.22
Pygments                 2.18.0
pynndescent              0.5.13
pyparsing                3.2.0
PySocks                  1.7.1
python-dateutil          2.9.0
pytz                     2024.1
PyYAML                   6.0.2
pyzmq                    26.2.0
rapids_singlecell        0.10.10
scanpy                   1.10.0
scikit-image             0.24.0
scikit-learn             1.5.2
scikit-misc              0.5.1
scipy                    1.14.1
seaborn                  0.13.2
session-info             1.0.0
setuptools               75.1.0
six                      1.16.0
sortedcontainers         2.4.0
sparse                   0.15.4
stack-data               0.6.2
statsmodels              0.14.4
stdlib-list              0.11.0
sympy                    1.13.1
tblib                    3.0.0
texttable                1.7.0
threadpoolctl            3.5.0
tifffile                 2024.9.20
toolz                    1.0.0
torch                    2.5.0
tornado                  6.4.1
tqdm                     4.66.5
traitlets                5.14.3
triton                   3.1.0
typing_extensions        4.12.2
tzdata                   2024.2
umap-learn               0.5.6
unicodedata2             15.1.0
urllib3                  2.2.3
wcwidth                  0.2.13
wheel                    0.44.0
xyzservices              2024.9.0
zarr                     2.18.3
zict                     3.0.0
zipp                     3.20.2
zstandard                0.23.0
mumichae commented 1 month ago

Try downgrading dask by editing the rapids_singlecell.yaml, fixing dask<2024.8. You can then update the environment:

conda env update -f envs/rapids_singlecell.yaml
Feilijiang commented 1 month ago

Hi Michaela,

I remove the rapids_singlecell environment and reinstalled it. The error disappeared.

mamba env create -f envs/rapids_singlecell.yaml

Many thanks for your help.

mumichae commented 3 weeks ago

Follow-up needed: document that conda channel priority needs to be set to flexible