calico / scnym

Semi-supervised adversarial neural networks for classification of single cell transcriptomics data
https://scnym.research.calicolabs.com
Apache License 2.0
73 stars 12 forks source link

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm` #24

Open cartal opened 2 years ago

cartal commented 2 years ago

Hi,

Thank you for developing scNym, I have been using it a lot for label transfer tasks and it is great!. So far, my workflow has worked flawlessly until I moved to a new workstation.

When I run the following:

scnym.api.scnym_api(
    adata = combined_object,
    task = 'train',
    groupby = 'cell_states',
    domain_groupby='domain_label',
    out_path = '/scnym_models/healthy/',
    config = 'new_identity_discovery',
)

It fails with the following error:

CUDA compute device found.
32767 unlabeled observations found.
Using unlabeled data as a target set for semi-supervised, adversarial training.

training examples:  (307282, 15412)
target   examples:  (32767, 15412)
X:  (307282, 15412)
y:  (307282,)
Using user provided domain labels.
Found 164 source domains and 6 target domains.
Not weighting classes and not balancing classes.
Found 170 unique domains.
Using MixMatch for semi-supervised learning
Scaling ICL over 100 epochs, 0 epochs for burn in.
Scaling ICL over 20 epochs, 0 epochs for burn in.
Using a Domain Adaptation Loss.
Training...
Epoch 0/99|______________________________|
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?f37e811e-484a-43aa-a78f-a31b60f7d9b4)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
221123_train_scNym_reference-Healthy_model.ipynb Cell 18 in <cell line: 1>()
----> [1](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) scnym.api.scnym_api(
      [2](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=1)     adata = combined_object,
      [3](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2)     task = 'train',
      [4](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=3)     groupby = 'cell_states',
      [5](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4)     domain_groupby='domain_label',
      [6](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5)     out_path = '/scnym_models/healthy_hlca/',
      [7](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=6)     config = 'new_identity_discovery',
      [8](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=7) )

File ~/mambaforge/envs/scnym/lib/python3.8/site-packages/scnym/api.py:339, in scnym_api(adata, task, groupby, domain_groupby, out_path, trained_model, config, key_added, copy)
    336         msg = f'{groupby} is not a variable in `adata.obs`'
    337         raise ValueError(msg)
--> 339     scnym_train(
    340         adata=adata,
    341         config=config,
    342     )
    343 else:
    344     # check that a pre-trained model was specified or 
    345     # provided for prediction
    346     if trained_model is None:

File ~/mambaforge/envs/scnym/lib/python3.8/site-packages/scnym/api.py:514, in scnym_train(adata, config)
...
-> 1370     ret = torch.addmm(bias, input, weight.t())
   1371 else:
   1372     output = input.matmul(weight.t())

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Since this happened after I changed workstations, I assume it has to do with some compatibility issues with CUDA, but I can't really get my head around it.

Do you think you could help me with this?

Thank you!

Session info here:

The `sinfo` package has changed name and is now called `session_info` to become more discoverable and self-explanatory. The `sinfo` PyPI package will be kept around to avoid breaking old installs and you can downgrade to 0.3.2 if you want to use it without seeing this message. For the latest features and bug fixes, please install `session_info` instead. The usage and defaults also changed slightly, so please review the latest README at https://gitlab.com/joelostblom/session_info.
-----
anndata     0.8.0
scanpy      1.6.0
sinfo       0.3.4
-----
PIL                         9.3.0
absl                        NA
asttokens                   NA
backcall                    0.2.0
certifi                     2022.09.24
chardet                     3.0.4
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.5.1
decorator                   5.1.1
dunamai                     1.14.1
entrypoints                 0.4
executing                   0.8.3
get_version                 3.5.4
google                      NA
h5py                        3.7.0
idna                        2.10
igraph                      0.10.2
importlib_metadata          NA
ipykernel                   6.9.1
jedi                        0.18.1
joblib                      1.2.0
kiwisolver                  1.4.4
legacy_api_wrap             1.2
leidenalg                   0.8.0
llvmlite                    0.32.1
louvain                     0.7.0
matplotlib                  3.5.3
mpl_toolkits                NA
natsort                     8.2.0
numba                       0.49.1
numexpr                     2.8.4
numpy                       1.23.5
packaging                   21.3
pandas                      1.5.1
parso                       0.8.3
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
prompt_toolkit              3.0.20
ptyprocess                  0.7.0
pure_eval                   0.2.2
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.6.0
pydevd_concurrency_analyser NA
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.11.2
pyparsing                   3.0.9
pytz                        2022.6
requests                    2.23.0
scipy                       1.4.1
scnym                       0.3.2
setuptools                  65.5.1
setuptools_scm              NA
six                         1.16.0
sklearn                     0.22.2.post1
stack_data                  0.2.0
tables                      3.6.1
tensorboard                 2.2.1
texttable                   1.6.5
torch                       1.4.0
torchvision                 0.5.0
tornado                     6.1
tqdm                        4.44.1
traitlets                   5.1.1
typing_extensions           NA
urllib3                     1.25.8
wcwidth                     0.2.5
yaml                        5.3.1
zipp                        NA
zmq                         23.2.0
-----
IPython             8.4.0
jupyter_client      7.2.2
jupyter_core        4.10.0
-----
Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35) [GCC 10.4.0]
Linux-6.0.8-200.fc36.x86_64-x86_64-with-glibc2.10
16 logical CPU cores, x86_64
-----
Session information updated at 2022-11-23 15:06
nagendraKU commented 1 year ago

I run into the same issue when attempting to run the training on Google Colab (high mem VM with A100 GPU). Installed scNym from github using pip.

Any help is appreciated!

cartal commented 1 year ago

@jacobkimmel any help/advice with this would be much appreciated! Thanks

sruthi-hub commented 1 year ago

@cartal @nagendraKU Were you able to fix this? I have the same error.

nagendraKU commented 1 year ago

@cartal @sruthi-hub Since the repo seems to be inactive, I am pasting here the session info from a working local conda installation of scNym. Maybe this is useful for you to get scNym running locally.

You will also need CUDA toolkit 10.2.89 & cudnn 8.2.4.15 for cuda 10.2. My conda env also has gcc 11.1.0 but I am not sure if this is strictly needed (my HPC system needs a bunch of stuff by default).

absl-py==1.0.0
anndata==0.7.4
anyio==3.5.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.0.5
attrs==21.4.0
Babel==2.10.1
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.0
cachetools==4.2.4
certifi==2022.9.24
cffi==1.15.0
chardet==3.0.4
charset-normalizer==2.1.1
ConfigArgParse==1.1
cycler==0.11.0
debugpy==1.6.0
decorator==5.1.1
defusedxml==0.7.1
dunamai==1.11.1
entrypoints==0.4
executing==0.8.3
fastjsonschema==2.15.3
fonttools==4.32.0
get_version==3.5.4
google-auth==1.35.0
google-auth-oauthlib==0.4.6
grpcio==1.44.0
h5py==2.10.0
idna==3.4
igraph==0.9.10
importlib-metadata==4.11.3
importlib-resources==5.7.1
ipykernel==6.13.0
ipython==8.2.0
ipython-genutils==0.2.0
jedi==0.18.1
Jinja2==3.1.1
joblib==1.1.0
json5==0.9.6
jsonschema==4.4.0
jupyter-client==7.2.2
jupyter-core==4.10.0
jupyter-server==1.16.0
jupyterlab==3.3.4
jupyterlab-pygments==0.2.2
jupyterlab-server==2.13.0
kiwisolver==1.4.2
legacy-api-wrap==1.2
leidenalg==0.8.0
llvmlite==0.32.1
louvain==0.7.0
Markdown==3.3.6
MarkupSafe==2.1.1
matplotlib==3.5.1
matplotlib-inline==0.1.3
mistune==0.8.4
more-itertools==8.12.0
natsort==8.1.0
nbclassic==0.3.7
nbclient==0.6.0
nbconvert==6.5.0
nbformat==5.3.0
nest-asyncio==1.5.5
networkx==2.8
notebook==6.4.11
notebook-shim==0.1.0
numba==0.49.1
numexpr==2.8.1
numpy==1.23.3
numpy-groupies==0.9.13
oauthlib==3.2.0
packaging==21.3
pandas==1.4.2
pandocfilters==1.5.0
parso==0.8.3
patsy==0.5.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.2.0
pluggy==0.13.1
prometheus-client==0.14.1
prompt-toolkit==3.0.29
protobuf==3.20.0
psutil==5.9.0
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
Pygments==2.11.2
pynndescent==0.5.6
pyparsing==3.0.8
pyrsistent==0.18.1
pytest==5.4.1
python-dateutil==2.8.1
python-igraph==0.9.10
pytz==2022.1
PyYAML==5.3.1
pyzmq==22.3.0
requests==2.28.1
requests-cache==0.5.2
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
rsa==4.8
scanpy==1.6.0
scikit-learn==0.22.2.post1
scikit-misc==0.1.3
scipy==1.8.0
scnym==0.3.2
seaborn==0.11.2
Send2Trash==1.8.0
setuptools-scm==6.4.2
sinfo==0.3.4
six==1.14.0
sniffio==1.2.0
soupsieve==2.3.2.post1
stack-data==0.2.0
statsmodels==0.13.2
stdlib-list==0.8.0
tables==3.6.1
tensorboard==2.2.1
tensorboard-plugin-wit==1.6.0.post2
tensorboardX==2.1
terminado==0.13.3
texttable==1.6.4
tinycss2==1.1.1
tomli==2.0.1
torch==1.11.0
torchvision==0.12.0
tornado==6.1
tqdm==4.44.1
traitlets==5.1.1
typing_extensions==4.3.0
umap-learn==0.3.10
urllib3==1.26.12
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.3.2
Werkzeug==2.1.1
zipp==3.8.0