elixir-nx / xla

Pre-compiled XLA extension
Apache License 2.0
83 stars 21 forks source link

NOT_FOUND: could not find registered transfer manager for platform Host -- check target linkage #77

Closed joelberkeley closed 4 months ago

joelberkeley commented 4 months ago

I'm unable to get a local client on recent versions of XLA (I believe after 0.3.0, definitely for 0.6.0)

2024-02-22 14:42:37.696111: F xla/client/client_library.cc:129] Non-OK-status: client_status.status() status: NOT_FOUND: could not find registered transfer manager for platform Host -- check target linkage
Aborted

I've created the following MWE that should reproduce it on mac M1. I'm seeing the same problem in github actions so I am confident you would see it on an Ubuntu machine with the x86_64-linux-gnu-cpu binary

FROM ubuntu

RUN apt update && apt install -y curl build-essential

WORKDIR mwe

RUN curl -s -L https://github.com/elixir-nx/xla/releases/download/v0.6.0/xla_extension-aarch64-linux-gnu-cpu.tar.gz | tar xz

COPY mwe.cpp .

RUN g++ mwe.cpp -o mwe -Ixla_extension/include/ -Lxla_extension/lib/ -lxla_extension

CMD LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(pwd)/xla_extension/lib/ ./mwe

where mwe.cpp is

#include "xla/client/client_library.h"

int main(void) {
  xla::ClientLibrary::LocalClientOrDie();
}
jonatanklosko commented 4 months ago

Hey @joelberkeley, in this repo we precompile a subset of the XLA library that is necessary specifically for our use case in Elixir (exla). The error you are getting may be because a certain part of XLA is missing. We are happy to support other use cases, as long as it does not increase the precompiled binary too much, however we can't really invest time digging into such issues ourselves.

joelberkeley commented 4 months ago

OK, that's fair, though I might emphasise that this used to work on 0.3.0.

BTW how sure are you that this is the problem? I'm finding it difficult to debug this so want to be as clear as possible

jonatanklosko commented 4 months ago

OK, that's fair, though I might emphasise that this used to work on 0.3.0.

Things we includ changed over time, and XLA itself keeps evolving, so the contents of various Bazel packages shifted. At one point XLA was extracted from tensorflow to a separate project.

BTW how sure are you that this is the problem? I'm finding it difficult to debug this so want to be as clear as possible

I am definitely not sure, it could be something with the environment as well. I mostly wanted to emphasise that we don't include the whole XLA and test only the functionality we use in the other project, so things missing are to be expected.

jonatanklosko commented 4 months ago

I think there are four options:

(a) wrong API usage; I don't know the internals, perhaps ClientLibrary expects something else to have been created/registered (b) something specific to the environment itself (c) a part of XLA missing in our binaries (expected, not really a bug) (d) a bug in the XLA source

joelberkeley commented 1 month ago

btw I resolved this by building XLA myself and moving to PJRT