elixir-nx / xla

Pre-compiled XLA extension
Apache License 2.0
89 stars 21 forks source link

Vector OOB index error with libstdc++-14 #106

Open feynmanliang opened 2 hours ago

feynmanliang commented 2 hours ago

When compiled with gcc 14 toolchain and linked against libstdc++-14, I get this error using EXLA as a nx backend

I0000 00:00:1731860475.739978      39 tfrt_cpu_pjrt_client.cc:349] TfrtCpuClient created.
/usr/include/c++/14/bits/stl_vector.h:1130: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = long unsigned int; _Alloc = std::allocator<long unsigned int>; reference = long unsigned int&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.

This error disappears when I compile with gcc-13 toolchain and link against libstdc++-13.

feynmanliang commented 2 hours ago

Example repro

ARG BUILDER_IMAGE="cgr.dev/chainguard/wolfi-base:latest"
ARG RUNNER_IMAGE="cgr.dev/chainguard/wolfi-base:latest"

ARG COMMIT=""

FROM ${BUILDER_IMAGE} as builder

# install build dependencies
RUN apk add --no-cache \
    bash \
    build-base \
    curl \
    gcc-10 \
    git \
    libstdc++-10 \
    make \
    ncurses-dev \
    nodejs \
    npm \
    openssl-dev \
    perl \
    rust

# BEGIN: remove this section to reproduce the error
# manually compile erlang/elixir against gcc-10
# see https://github.com/elixir-nx/xla/issues/106
ENV CC=gcc-10 \
    CXX=g++-10 \
    LD_LIBRARY_PATH=/usr/lib/gcc/x86_64-pc-linux-gnu/10.5.0
# END: remove this section to reproduce the error

# Install and configure ASDF
ENV ASDF_DIR=/root/.asdf
RUN git clone https://github.com/asdf-vm/asdf.git $ASDF_DIR

# Install ASDF plugins and versions
RUN bash -c '. $ASDF_DIR/asdf.sh && \
    asdf plugin add erlang && \
    asdf plugin add elixir'

# Install Erlang and Elixir using ASDF
RUN bash -c '. $ASDF_DIR/asdf.sh && \
    asdf install erlang 25.3.2.15 && \
    asdf install elixir 1.16.3-otp-25 && \
    asdf global erlang 25.3.2.15 && \
    asdf global elixir 1.16.3-otp-25'

ENV PATH="/root/.asdf/installs/erlang/25.3.2.15/bin:${PATH}"
ENV PATH="/root/.asdf/installs/elixir/1.16.3-otp-25/bin:${PATH}"

# prepare build dir
WORKDIR /app

# install hex + rebar
RUN mix local.hex --force && \
  mix local.rebar --force

COPY . .

RUN mix deps.get && \
  mix deps.compile

When this Dockerfile is used with a bumblebee/nx app with EXLA backend, iex -S mix -e 'Bumblebee.load_model({:hf, "model-name"})' inside this container.