[Proposal] Deprecating C codegen facility of Treelite + pivoting the project scope

hcho3 commented 1 year ago

Since its first inception in November 2017, Treelite project has undergone many changes. So has the machine learning ecosystem. Over the last few years, many prediction runtimes have emerged that can run prediction efficiently with decision tree ensembles, such as the FIL backend for Triton and the ONNX runtime. In many use cases, the runtimes exhibit superior performance compared to the C code generated by Treelite. The runtimes also offer better user experience, as users no longer need to invoke the C compiler to convert the generated C code into binaries, an often time-consuming step.

In addition, I no longer have much bandwidth to maintain or improve the C codegen facility.

Consequently, I propose paring down the scope of the Treelite project to what it currently does its very best: a universal, lightweight exchange format for all tree models.

Proposal:

The C codegen facility will be deprecated and eventually be removed. Removing the C codegen will reduce the size of Treelite code base significantly and make it easier for future contributors to maintain it.
The Treelite runtime will be phased out, since Treelite would no longer offer a dedicated runtime for compiled binaries.
Treelite will offer the model parsers (loaders), as well as GTIL (General Tree Inference Library). GTIL will continue to serve as a reference implementation for tree prediction.
Downstream libraries will continue to use the Treelite exchange format to store and transmit tree ensemble models.

Alternative Solution. Move the C codegen facility to a separate project. For this scenario, I'm keen to pass the ownership of the new project to someone else, as I don't have time to work on the codegen.

Comments will be highly appreciated.

cc @wphicks

hcho3 commented 1 year ago

@edwardwliu Your example notebook currently uses export_lib function, which this proposal proposes to deprecate. The notebook can be updated straightforwardly to use the treelite.gtil module instead. Besides, the features you requested (predict_leaf and predict_bytree) will be added to treelite.gtil, not export_lib. (I'm working on these features this month; stay tuned) I am more than happy to assist you further in migrating your use case.

hcho3 commented 1 year ago

@getumen @dovahcrow @Earthson Pinging you here, since you create Go and Rust bindings for Treelite. If you'd like to take ownership of the C codegen part of Treelite, let me know. (See Alternative Solution.)

hcho3 commented 1 year ago

Given the lack of explicit objection, I will proceed with the proposal. See #465

dovahcrow commented 1 year ago

Given the lack of explicit objection, I will proceed with the proposal. See #465

Sure, go ahead. Unfortunately, I don't have the bandwidth right now to review changes and upgrade the Rust lib. However, I would happily migrate to the new lib / API later once I have time.

getumen commented 1 year ago

@hcho3 Thank you for your notification. I don't have enough C language experiments. However, I would happily migrate to the new API later.

getumen commented 1 year ago

I try to use FIL in Go. FIL on GPU is two times slower than treelite in my workload. https://github.com/getumen/cuml

hcho3 commented 1 year ago

@getumen My colleague alerted me of your Go binding for RAPIDS cuML. Nicely done!

Some remarks:

There have been performance improvements in the latest version of FIL. The latest version is 23.08. Is there anything we can help you to adopt the latest cuML?
RAPIDS team is working on a next generation of FIL which is faster than the current FIL. It is provided under the experimental namespace. The next-generation FIL includes an CPU algorithm. For smaller workloads, it is often faster to run prediction on CPU to avoid the latency overhead of data copy to and from GPU memory.

getumen commented 1 year ago

@hcho3 Thank you for your information! I try to update cuml version. I also try to increase batch size in my workload.

getumen commented 1 year ago

@hcho3 I have a trouble in building cuML 23.08.00. When I build cuML 23.08.00, segmentation fault occurs in building RAFT. I'm building cuML in Windows Docker Desktop. I use nvidia/cuda:11.8.0-devel-ubuntu22.04 as base image and the Dockerfile is the following. Do I need any additional setup to build cuML?

FROM nvidia/cuda:11.8.0-devel-ubuntu22.04

ENV DEBIAN_FRONTEND=noninteractive

RUN sed -i -r 's@http://(jp\.)?archive\.ubuntu\.com/ubuntu/?@http://ftp.jaist.ac.jp/pub/Linux/ubuntu/@g' /etc/apt/sources.list

ARG TREELITE_VERSION=3.2.0
ARG CUML_VERSION=v23.08.00

RUN apt-get update \
    && apt-get install -y \
    sudo \
    vim \
    less \
    git \
    wget \
    libssl-dev \
    autoconf \
    libtool \
    devscripts \
    debhelper \
    libblas-dev \
    liblapack-dev \
    zlib1g \
    cython3 \
    cuda-toolkit-11-8 \
    clang \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

RUN wget https://github.com/Kitware/CMake/releases/download/v3.27.3/cmake-3.27.3.tar.gz \
    && tar -zxf cmake-3.27.3.tar.gz \
    && cd cmake-3.27.3 \
    && ./bootstrap \
    && make -j$(nproc) \
    && make install \
    && cd .. \
    && rm -rf cmake-3.27.3 cmake-3.27.3.tar.gz

# install ucx
RUN git clone --recursive https://github.com/openucx/ucx.git \
    && cd ucx \
    && ./autogen.sh \
    && ./contrib/configure-release \
    && make -j$(nproc) install \
    && cd .. \
    && rm -rf ucx

# install nccl for gtx 1080
RUN git clone https://github.com/NVIDIA/nccl.git \
    && cd nccl \
    && make -j$(nproc) pkg.debian.build NVCC_GENCODE="-gencode=arch=compute_61,code=sm_61" \
    && dpkg -i build/pkg/deb/* \
    && cd .. \
    && rm -rf nccl

# ref. https://github.com/rapidsai/cuml/issues/2528#issuecomment-656847070
RUN git clone https://github.com/rapidsai/cuml.git -b ${CUML_VERSION} \
    && cd cuml/cpp \
    && mkdir build \
    && cd build \
    && cmake .. \
    -DBUILD_CUML_TESTS=OFF \
    -DBUILD_PRIMS_TESTS=OFF \
    -DENABLE_CUMLPRIMS_MG=OFF \
    -DBUILD_CUML_EXAMPLES=OFF \
    -DBUILD_CUML_BENCH=OFF \
    -DDETECT_CONDA_ENV=OFF \
    && make install -j$(nproc) \
    && cd ../../.. \
    && rm -r cuml

Segmentation fault
make[2]: *** [CMakeFiles/raft_lib.dir/build.make:781: CMakeFiles/raft_lib.dir/src/neighbors/detail/cagra/search_multi_cta_float_uint32_dim1024_t32.cu.o] Error 139
make[2]: *** Waiting for unfinished jobs....

Segmentation fault
make[2]: *** [_deps/raft-build/CMakeFiles/raft_lib.dir/build.make:1156: _deps/raft-build/CMakeFiles/raft_lib.dir/src/neighbors/detail/ivf_pq_compute_similarity_float_float.cu.o] Error 139
make[2]: *** Waiting for unfinished jobs....

hcho3 commented 1 year ago

I was able to run the docker build with the Dockerfile you posted. It seems like you might be running out of memory. Can you reduce parallelism -j to a reasonable number?

If the issue persists, let's post a new issue in the cuML repository. Our team will investigate.

getumen commented 1 year ago

@hcho3 Thank you for your advice! Reducing parallelism solves segmentation fault. However, another error was detected in the compilation of src/arima/batched_arima.cu. I posted a issue about that. https://github.com/rapidsai/cuml/issues/5566

dmlc / treelite

[Proposal] Deprecating C codegen facility of Treelite + pivoting the project scope #438