NVIDIA / spark-rapids-examples

A repo for all spark examples using Rapids Accelerator including ETL, ML/DL, etc.
Apache License 2.0
129 stars 51 forks source link

RAPIDS accelerated UDF examples build environment does not match spark-rapids-jni environment #362

Open jlowe opened 9 months ago

jlowe commented 9 months ago

The Dockerfile used for the RAPIDS accelerated native UDF example build environment is using Ubuntu18.04, but the build environment used by spark-rapids-jni for the libcudf.so that will be placed in the RAPIDS Accelerator jar is using centos7+devtoolset. That means code could be crossing the GCC CXX11 ABI streams and lead to failures to find symbols at runtime when trying to load the native UDF shared library, e.g.:

/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin/java: symbol lookup error: /tmp/nativeudfjni8442648179436293266.so: undefined symbol: _ZN4cudf13string_scalarC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEbN3rmm16cuda_stream_viewEPNS9_2mr22device_memory_resourceE

which when run through cu++filt shows this is a failure to find:

cudf::string_scalar::string_scalar(const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> &, bool, rmm::cuda_stream_view, rmm::mr::device_memory_resource *)

The Dockerfile used by the examples should be using the same setup as spark-rapids-jni to avoid this. We should also add a RAPIDS Accelerated native UDF that uses a string_scalar with a std::string argument to help catch this ABI mismatch in the future.

sameerz commented 9 months ago

RAPIDS may drop support for CentOS7 in the upcoming release, and has Ubuntu 20.04 as a minimum required version ( https://docs.rapids.ai/install#system-req ). Does that change what we need to do here?

Or do we still need to ensure the Dockerfile used by the examples is using the same setup as spark-rapids-jni, and we need to update the spark-rapids-jni setup to account for the new minimum OS versions?

Ref: https://endoflife.software/operating-systems/linux/red-hat-enterprise-linux-rhel

jlowe commented 9 months ago

Or do we still need to ensure the Dockerfile used by the examples is using the same setup as spark-rapids-jni, and we need to update the spark-rapids-jni setup to account for the new minimum OS versions?

This. Bottom line is the examples need to build in the same environment as spark-rapids-jni does, regardless of what that environment actually is. Note that we still want to build spark-rapids-jni in a way that allows a single binary to run on all supported OS's, and I'm doubtful we can simply build on Ubuntu 20.04's default toolchain to try to satisfy that requirement.

GaryShen2008 commented 9 months ago

If we plan to update the spark-rapids-jni build setup in 24.04, we can do this issue after changing spark-rapids-jni. Have we already decided to drop centos 7 in 24.04? If so, let's file another issue in spark-rapids-jni to decide which environment to be used for compiling to meet the requirement of single binary to run on alll supported OS.

sameerz commented 9 months ago

Have we already decided to drop centos 7 in 24.04? If so, let's file another issue in spark-rapids-jni to decide which environment to be used for compiling to meet the requirement of single binary to run on alll supported OS.

It looks like RAPIDS will deprecate CentOS7 in 24.04 and stop support in 24.06, per https://github.com/rapidsai/docs/pull/475

For 24.04 we should make sure the Dockerfile used for the examples matches the same one used for spark-rapids-jni (Centos7+devtoolset)

In parallel we should figure out what our minimum toolchain will be so we are ready in 24.06.

GaryShen2008 commented 3 months ago

Hi @YanxuanLiu, is it possible to use the same docker file to build UDF example as the JNI?

YanxuanLiu commented 3 months ago

Hi @YanxuanLiu, is it possible to use the same docker file to build UDF example as the JNI?

Sry but I think @NvTimLiu could help on this issue. I haven't dealt with this issue.

GaryShen2008 commented 3 months ago

Hi @NvTimLiu, Can you check if it's possible to use the same docker to build UDF example as the JNI?

NvTimLiu commented 3 months ago

Good for CI to use the same docker image as the rapids JNI to build UDF examples

We have a Dockerfile specified for building UDF examples https://github.com/NVIDIA/spark-rapids-examples/blob/main/examples/UDF-Examples/RAPIDS-accelerated-UDFs/Dockerfile

Shall we remove it, and document it that we build UDF examples with the rapids JNI docker image?

NvTimLiu commented 3 months ago

Discussed with Gary, we'll use the same docker in CI job and document the link of dockerfile in JNI.

I'll handle it.