GoogleCloudDataproc / initialization-actions

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster
https://cloud.google.com/dataproc/init-actions
Apache License 2.0
588 stars 512 forks source link

Could not use cudf or cuml when rapids-runtime = DASK #1039

Open blis-teng opened 1 year ago

blis-teng commented 1 year ago

I am trying to setup a dataproc cluster with GPU attached, to use cuml and cudf, I followed the instruction https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/rapids/README.md And able to setup the cluster, with nvidia driver successfully installed. But when I try

import cudf

It throws out the error

TypeError: C function cuda.ccudart.cudaStreamSynchronize has wrong signature (expected __pyx_t_4cuda_7ccudart_cudaError_t (__pyx_t_4cuda_7ccudart_cudaStream_t), got cudaError_t (cudaStream_t)

I follow the instruction here: https://docs.rapids.ai/notices/rsn0020/ But after the downgraded version, another error show up when import cudf which is

No module named 'pandas.core.arrays._arrow_utils'

The dask rapids installation version in rapids.sh is 22.04

cjac commented 1 year ago

Thank you for this report!

@nvliyuan do you want to take a look at this?

cjac commented 1 year ago

Actually, nvliyuan has been contributing to the spark runtime. Not certain who to tap about the dask runtime. I'll check the commit history shortly and get back to you.

nvliyuan commented 1 year ago

Hi @cjac, since dask script failed since 22.06 version(2022.06), see this comments, so I believe this issue exists for a long time, maybe @mengdong @sameerz could involve some dask-rapids guys?

jacobtomlinson commented 1 year ago

Hey folks! I work on RAPIDS and Dask, happy to help. We are currently in the process of documenting and testing deploying RAPIDS on cloud platforms but I expect we will not get to Dataproc until after the holidays. But we will definitely dig into this as part of that work.

Pinging @mroeschke who may have some quick thoughts about the Pandas error. I expect pandas needs upgrading/downgrading.

mroeschke commented 1 year ago

I suspect your environment has pandas>=1.5 installed, and cudf was not compatible with that version of pandas until 22.10.

Therefore if you downgrade pandas<1.5 or upgrade cudf>22.10 the error No module named 'pandas.core.arrays._arrow_utils' should go away

cjac commented 1 year ago

Thank you Jacob and Matt!

@blis-teng - please let us know if this solves this issue for you so we can mark the issue resolved or otherwise offer an appropriate solution.

cjac commented 1 year ago

@blis-teng - are you able to share the gcloud dataproc clusters create command you're using to spin up your cluster? I can try to give it a repro and see if I run into the same problems.

If you've got a support contract with GCP, I'd appreciate if you could open a support case and provide me the case #. By doing this, we can track our work and share case details privately rather than on the permanent record for the initialization-actions repository. Please do not open development cases as P2 or P1, as those are reserved for production outage situations, and development is by definition not a production environment.

C.J. in Cloud Support, Seattle

blis-teng commented 1 year ago

I suspect your environment has pandas>=1.5 installed, and cudf was not compatible with that version of pandas until 22.10.

Therefore if you downgrade pandas<1.5 or upgrade cudf>22.10 the error No module named 'pandas.core.arrays._arrow_utils' should go away

I have tried, but it will not work.

  1. If I install pandas 1.3 in rapids.sh, in the dataproc, run "conda list" you can see the version is still 1.5, and the import error changed, but still related to pandas.
  2. If I try to install pandas 1.3 in Jupyter notebook after dataproc is ready, the "manaba install" will block the installation due to some dependencies are not resolved from any of the given channel.
blis-teng commented 1 year ago

I used the cmd line from the given documentation in https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/rapids/README.md Minor details may be different, but the key parameters (gpu driver, rapids-runtime) are the same.

export CLUSTER_NAME=<cluster_name>
export GCS_BUCKET=<your bucket for the logs and notebooks>
export REGION=<region>
export NUM_GPUS=1
export NUM_WORKERS=2

gcloud dataproc clusters create $CLUSTER_NAME  \
    --region $REGION \
    --image-version=dp20 \
    --master-machine-type n1-custom-63500 \
    --num-workers $NUM_WORKERS \
    --worker-accelerator type=nvidia-tesla-t4,count=$NUM_GPUS \
    --worker-machine-type n1-standard-8 \
    --num-worker-local-ssds 1 \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh,gs://goog-dataproc-initialization-actions-${REGION}/rapids/rapids.sh \
    --optional-components=JUPYTER,ZEPPELIN \
    --metadata gpu-driver-provider="NVIDIA",rapids-runtime="DASK" \
    --bucket $GCS_BUCKET \
cjac commented 1 year ago

okay, I'll try to reproduce it now.

cjac commented 1 year ago

With these arguments, it is installing pandas-1.2.5 and libcudf-22.04.00-cuda11. I think I found a bug in the rapids.sh script. I'll see if patching it improves the situation.

cjac commented 1 year ago

In order to use 22.10 with pandas>=1.5, I need to upgrade these python packages:

"cuspatial=${CUSPATIAL_VERSION}" "rope>=0.9.4" "gdal>3.5.0"

And gdal>3.5.0 is not available in bullseye. Backports only go up to 3.2, so I'm going to try ubuntu20.

cjac commented 1 year ago
cjac@cluster-1668020639-w-0:~$ apt-cache show libgdal-dev | grep ^Version
Version: 3.0.4+dfsg-1build3
cjac@cluster-1668020639-w-0:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.5 LTS
Release:        20.04
Codename:       focal
cjac commented 1 year ago

so no, it looks like pandas >= 1.5 is not stable. I'll try doing the lower numbers.

cjac commented 1 year ago
+ mamba install -y --no-channel-priority -c conda-forge -c nvidia -c rapidsai cudatoolkit=11.5 'pandas<1.5' rapids=22.04

Looking for: ['cudatoolkit=11.5', "pandas[version='<1.5']", 'rapids=22.04']

Pinned packages:
  - python 3.10.*
  - conda 22.9.*
  - python 3.10.*
  - r-base 4.1.*
  - r-recommended 4.1.*

Encountered problems while solving:
  - package rapids-22.04.00-cuda11_py39_ge08d166_149 requires python >=3.9,<3.10.0a0, but none of the providers can be installed

cjac@cluster-1668020639-w-0:~$ which conda
/opt/conda/default/bin/conda
cjac@cluster-1668020639-w-0:~$ /opt/conda/default/bin/python --version
Python 3.10.8

Now it looks like the python interpreter we install with dataproc is too new for the rapids release. I'll try 22.06 and 22.08 to see if either of those versions work.

cjac commented 1 year ago

Okay, I was able to get this working on 2.0-debian10 with dask-rapids 22.06

I had to specify this mamba command:

mamba install -n 'dask-rapids' -y --no-channel-priority -c 'conda-forge' -c 'nvidia' -c 'rapidsai' \ "cudatoolkit=${CUDA_VERSION}" "pandas<1.5" "rapids=${RAPIDS_VERSION}" "python=3.9"

I'm testing the change with dask-rapids 22.08 ; if that works as well, I will submit a PR.

cjac commented 1 year ago

@blis-teng - please try replacing the rapids.sh you link to from your project's initialization-actions checkout with this one.

https://github.com/cjac/initialization-actions/raw/dask-rapids-202212/rapids/rapids.sh

I am working with the product team to review this change. I should be able to close up PR #1041 pretty quick here.

cjac commented 1 year ago

You may have mentioned that you have not yet read the README.md[1] from the initialization-actions repository. Can you please review and confirm for me that you understand where you would like to copy rapids.sh[2] from my pre-release branch for testing?

[1] https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used [2] https://github.com/cjac/initialization-actions/raw/dask-rapids-202212/rapids/rapids.sh

cjac commented 1 year ago

@blis-teng can you re-try using the latest rapids/rapids.sh from github?

blis-teng commented 1 year ago

hi, @cjac sorry for the late reply, I will re-try the new rapids.sh and get back to you next week, thanks!

cjac commented 1 year ago

Thank you. Standing by for confirmation! 20230106T084758 + 7d will be 20230113T084757.

I am presently not able to reproduce your problem. If there is still a change to be made, I'd like to know that information early in the week, please.

cjac commented 1 year ago

Please remember to read the README I referenced. You are violating the guidance by using

--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh,gs://goog-dataproc-initialization-actions-${REGION}/rapids/rapids.sh \

skirui-source commented 1 year ago

Hi @cjac , could you please update to work with latest dask-rapids v22.12?

cjac commented 1 year ago

Not last I checked. What versions are you pinning to?

skirui-source commented 1 year ago

@cjac

root@test-dataproc-rapids-dask-m:/# conda list ^cu
# packages in environment at /opt/conda/miniconda3:
#
# Name                    Version                   Build  Channel
cucim                     22.04.00        cuda_11_py38_g8dfed80_0    rapidsai
cuda-python               11.8.1           py38h241159d_2    conda-forge
cudatoolkit               11.2.72              h2bc3f7f_0    nvidia
cudf                      22.04.00        cuda_11_py38_g8bf0520170_0    rapidsai
cudf_kafka                22.04.00        py38_g8bf0520170_0    rapidsai
cugraph                   22.04.00        cuda11_py38_g58be5b53_0    rapidsai
cuml                      22.04.00        cuda11_py38_g95abbc746_0    rapidsai
cupy                      9.6.0            py38h177b0fd_0    conda-forge
cupy-cuda115              10.6.0                   pypi_0    pypi
curl                      7.86.0               h7bff187_1    conda-forge
cusignal                  22.04.00        py39_g06f58b4_0    rapidsai
cuspatial                 22.04.00        py38_ge8f9f84_0    rapidsai
custreamz                 22.04.00        py38_g8bf0520170_0    rapidsai
cuxfilter                 22.04.00        py38_gf251a67_0    rapidsai

root@test-dataproc-rapids-dask-m:/# conda list ^das
# packages in environment at /opt/conda/miniconda3:
#
# Name                    Version                   Build  Channel
dask                      2022.3.0           pyhd8ed1ab_1    conda-forge
dask-bigquery             2022.5.0           pyhd8ed1ab_0    conda-forge
dask-core                 2022.3.0           pyhd8ed1ab_0    conda-forge
dask-cuda                 22.04.00                 py38_0    rapidsai
dask-cudf                 22.04.00        cuda_11_py38_g8bf0520170_0    rapidsai
dask-glm                  0.2.0                      py_1    conda-forge
dask-ml                   2022.5.27          pyhd8ed1ab_0    conda-forge
dask-sql                  2022.8.0           pyhd8ed1ab_0    conda-forge
dask-yarn                 0.9              py38h578d9bd_2    conda-forge

I was wondering if you can upgrade rapids.sh to install latest rapids v22.12? or is there a reason not to? (**Ps I am aware you recently upgraded to 22.10. which I am yet to test)

cjac commented 1 year ago

Hi @cjac , could you please update to work with latest dask-rapids v22.12?

I'm about to go on vacation, and I'm trying to put projects down. Can you open a new issue or better yet a GCP support case so I don't lose track of the work item, please?

This issue is about the action not working. I think it's working now, but not patched up to latest release. A separate issue would be appropriate.