explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.22k stars 4.4k forks source link

ROCm 5.7 + Spacy #13279

Closed erasmus74 closed 9 months ago

erasmus74 commented 9 months ago

Hello, I have a vested interest in taking this over and hopefully finish the process.

Currently we're at version 6.0 of ROCm and Cupy currently has experimental support for 5.7. So we will pin to 5.7 for ROCm in tests for now until Cupy supports 6.0

So here's my testing environment; swappy-20240126_175111

here's my Python, ROCm and pip packages (in a virtualenv)

swappy-20240126_175312


Package                    Version
-------------------------- --------------------------
annotated-types            0.6.0
blis                       0.7.11
catalogue                  2.0.10
certifi                    2023.11.17
charset-normalizer         3.3.2
click                      8.1.7
cloudpathlib               0.16.0
confection                 0.1.4
cupy                       13.0.0
curated-tokenizers         0.0.9
curated-transformers       0.1.1
cymem                      2.0.8
en-core-web-lg             3.7.1
en-core-web-trf            3.7.3
fastrlock                  0.8.2
filelock                   3.13.1
fsspec                     2023.12.2
huggingface-hub            0.20.3
idna                       3.6
Jinja2                     3.1.3
langcodes                  3.3.0
MarkupSafe                 2.1.4
mpmath                     1.3.0
murmurhash                 1.0.10
networkx                   3.2.1
numpy                      1.26.3
nvidia-cublas-cu12         12.1.3.1
nvidia-cuda-cupti-cu12     12.1.105
nvidia-cuda-nvrtc-cu12     12.1.105
nvidia-cuda-runtime-cu12   12.1.105
nvidia-cudnn-cu12          8.9.2.26
nvidia-cufft-cu12          11.0.2.54
nvidia-curand-cu12         10.3.2.106
nvidia-cusolver-cu12       11.4.5.107
nvidia-cusparse-cu12       12.1.0.106
nvidia-nccl-cu12           2.18.1
nvidia-nvjitlink-cu12      12.3.101
nvidia-nvtx-cu12           12.1.105
packaging                  23.2
Pillow                     9.3.0
pip                        23.3.2
preshed                    3.0.9
pydantic                   2.5.3
pydantic_core              2.14.6
pytorch-triton-rocm        3.0.0+dafe145982
PyYAML                     6.0.1
regex                      2023.12.25
requests                   2.31.0
safetensors                0.4.2
setuptools                 69.0.3
smart-open                 6.4.0
spacy                      3.7.2
spacy-alignments           0.9.1
spacy-curated-transformers 0.2.2
spacy-legacy               3.0.12
spacy-loggers              1.0.5
spacy-lookups-data         1.0.5
spacy-transformers         1.3.4
srsly                      2.4.8
sympy                      1.12
thinc                      8.2.2
tokenizers                 0.15.1
torch                      2.3.0.dev20240126+rocm5.7
torchaudio                 2.2.0.dev20240126+rocm5.7
torchvision                0.18.0.dev20240126+rocm5.7
tqdm                       4.66.1
transformers               4.36.2
triton                     2.1.0
typer                      0.9.0
typing_extensions          4.9.0
urllib3                    2.1.0
wasabi                     1.1.2
weasel                     0.3.4
wheel                      0.42.0

Finally here's my Env Variables that are relevant to the deployment of spacy

CUPY_INSTALL_USE_HIP=1
HCC_AMDGPU_TARGET=gfx1100
HIP_VISIBLE_DEVICES=0
__HIP_PLATFORM_HCC__
ROCM_PATH=/opt/rocm
HSA_OVERRIDE_GFX_VERSION=11.0.0

When I run the available GPU test on my AMD system;

❯ python -c 'import torch; print(torch.cuda.is_available())'
True

installed cupy using steps detailed at: https://docs.cupy.dev/en/latest/install.html#using-cupy-on-amd-gpu-experimental

export HCC_AMDGPU_TARGET=gfx1100
export ROCM_HOME=/opt/rocm
export CUPY_INSTALL_USE_HIP=1
pip install cupy

Testing script for cuda/rocm

❯ python test_cuda.py

Checking ROCM support...
GOOD: ROCM devices found:  3
Checking PyTorch...
GOOD: PyTorch is working fine.
Checking user groups...
GOOD: The user master is in RENDER and VIDEO groups.
GOOD: PyTorch ROCM support found.
Testing PyTorch ROCM support...
Everything fine! You can run PyTorch code inside of: 
--->  AMD Ryzen 9 7900X 12-Core Processor  
--->  gfx1100            
--->  gfx1036    

Installing spacy

pip install -U pip setuptools wheel
git clone https://github.com/explosion/spaCy
cd spaCy
pip install -r requirements.txt
pip install --no-build-isolation --editable '.[transformers,lookups]'
python -m spacy download en_core_web_[lg,trf]

testing with basics spacy train

❯ HSA_OVERRIDE_GFX_VERSION=11.0.0 spacy train config.cfg --gpu-id 0
/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
ℹ No output directory provided
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
Traceback (most recent call last):
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/bin/spacy", line 8, in <module>
    sys.exit(setup_cli())
             ^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/spacy/cli/_util.py", line 87, in setup_cli
    command(prog_name=COMMAND)
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/typer/core.py", line 778, in main
    return _main(
           ^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/spacy/cli/train.py", line 54, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/spacy/cli/train.py", line 81, in train
    nlp = init_nlp(config, use_gpu=use_gpu)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/spacy/training/initialize.py", line 44, in init_nlp
    fix_random_seed(config["training"]["seed"])
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/thinc/util.py", line 105, in fix_random_seed
    cupy.random.seed(seed)
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/cupy/random/_generator.py", line 1274, in seed
    get_random_state().seed(seed)
    ^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/cupy/random/_generator.py", line 1306, in get_random_state
    rs = RandomState(seed)
         ^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/cupy/random/_generator.py", line 60, in __init__
    self._generator = curand.createGenerator(method)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "cupy_backends/cuda/libs/curand.pyx", line 95, in cupy_backends.cuda.libs.curand.createGenerator
  File "cupy_backends/cuda/libs/curand.pyx", line 99, in cupy_backends.cuda.libs.curand.createGenerator
  File "cupy_backends/cuda/libs/curand.pyx", line 88, in cupy_backends.cuda.libs.curand.check_status
cupy_backends.cuda.libs.curand.CURANDError: CURAND_STATUS_ALLOCATION_FAILED

testing without the HSA variable to prove the GPU is being targetted;

❯ HSA_OVERRIDE_GFX_VERSION= spacy train config.cfg --gpu-id 0
/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
ℹ No output directory provided
ℹ Using GPU: 0
Traceback (most recent call last):
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/bin/spacy", line 8, in <module>
    sys.exit(setup_cli())
             ^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/spacy/cli/_util.py", line 87, in setup_cli
    command(prog_name=COMMAND)
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/typer/core.py", line 778, in main
    return _main(
           ^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/spacy/cli/train.py", line 54, in train_cli
    train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/spacy/cli/train.py", line 76, in train
    setup_gpu(use_gpu)
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/spacy/cli/_util.py", line 272, in setup_gpu
    require_gpu(use_gpu)
  File "/home/master/workspace/spaCy-e360/spaCy/spacy/.env/lib/python3.11/site-packages/thinc/util.py", line 232, in require_gpu
    raise ValueError("No GPU devices detected")
ValueError: No GPU devices detected

So it would seem we have Rocm5.7, Cupy w/Rocm support, and the issue collects as

cupy_backends.cuda.libs.curand.CURANDError: CURAND_STATUS_ALLOCATION_FAILED

However looking at rocm6 hipRand: https://rocm.docs.amd.com/en/latest/about/release-notes.html#hiprand

I think its possible that everything is correct, but we simply can't actually use an equivalent ROCm function. But I have reached the limit of my understanding of this so far.

Another alternative issue is maybe Thinc is not supporting ROCm and so it can't load the GPU into buffer. I did test with different sets of packages, such as removing HIP libraries and alternate ROCm libs, results in combinations of GPU not found and "HIP" GPU not found. This is the farthest I've gotten, where its actually trying to init the GPU for processing.

Happy to test other scripts or debug as needed. I am dedicated to this issue as its the only thing holding me back from using GPU for much of my spacy needs. Thank you

Originally posted by @erasmus74 in https://github.com/explosion/spaCy/discussions/12229#discussioncomment-8262332

svlandeg commented 9 months ago

To avoid duplicate conversations, I will close this one and refer to the original thread: https://github.com/explosion/spaCy/discussions/12229

erasmus74 commented 9 months ago

This is not a duplicate, this ticket is specific to enablement of ROCm 5.7/HIP and Spacy. In the discussion, the user is talking about installing Cupy w/ rocm 5.0 using a method far outdated for Cupy and using a far outdated version of ROCm. ROCm 6 is in fact in public release by now.

The referenced thread, the user is trying to compile ROCm 5.0 and failing regardless of Spacy. I have installed ROCm 5.6,5.7 and 6.0 in an effort to get spacy to actually use it.

ekazakos commented 8 months ago

@erasmus74 Did you have any progress / findings on this?

erasmus74 commented 8 months ago

Yes actually. I'll edit this message tomorrow with some findings. I haven't had success yet, but a little progress since then.

github-actions[bot] commented 7 months ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.