Closed jnnks closed 2 years ago
This is weird because it even says at the beginning that the app was compiled defined. What happens if you do mix deps.compile xla
? What is in "_build/prod/lib/xla"?
What happens if you do mix deps.compile xla?
nothing, no output
What is in "_build/prod/lib/xla"?
see below
What is in "_build/prod/lib/xla"?
Nothing after the first compilation. Only after the second time, contents appear, including _build/prod/lib/xla/ebin/xla.app
:
similar situation with mix deps.compile xla
. no error, _build/prod/lib/xla/ebin/xla.app
exists afterwards
@jnnks can you please try this:
rm -rf _build
rm -rf deps
mix deps.get
XLA_BUILD=true MIX_ENV=prod mix deps.compile xla
tree _build/prod/lib/xla
XLA_BUILD=true MIX_ENV=prod mix deps.compile exla
tree _build/prod/lib/xla
I am suspecting exla compilation is the one erasing it somehow.
for some reason the first mix deps.compile xla
does not complete, but the second does. (Mix 1.13.4)
Ok, I missed some deps, sorry! it should have been this instead:
rm -rf _build
rm -rf deps
mix deps.get
XLA_BUILD=true MIX_ENV=prod mix deps.compile elixir_make xla
tree _build/prod/lib/xla
XLA_BUILD=true MIX_ENV=prod mix deps.compile complex nx exla
tree _build/prod/lib/xla
maybe complex is not required… but I think XLA will be there on both runs.
_build/prod/lib/xla/ebin/xla.app
is present both times
So when does it disappear?!?! Only on “mix compile”?
Seems like the problem only appears when building XLA from scratch. All the other times a cached archive has been used. Could that play a role?
Sounds like it but i was hoping the instructions above could reproduce it. If you finally do a mix compile
at the end of the last instructions, it is that when xla.app
finally disappears?
Nope, still there :) I'll let the full build run later with a directory watcher to see if the file ever existed
Schrodinger's xla.app. 😄
Thank you for digging deeper!
Looks like it was in fact deleted during the build process.
~The inotify logs are very long, so I am not posting it in here, but can attach it somewhere if necessary.~ See below
Awesome @jnnks! Can you please post the 100 entries before and after the DELETE?
Here are the entire logs :D
1st Run: https://gist.github.com/jnnks/88f2cda21064d0bb109a42ec4b701cb2
DELETE
is at line 797
2nd Run: https://gist.github.com/jnnks/ad8a25419b3d84a6cef83b9892a926e3
@jonatanklosko so this is caused by the explicit deps.compile xla
alias inside EXLA. Do you remember why it is needed?
https://github.com/elixir-nx/nx/blob/2769f4a91ca9737b2d2ecbafb94671ad08ba1499/exla/mix.exs#L26-L29
Without that, xla
is compiled once and changing XLA_TARGET
has no effect, because the Makefile doesn't run again.
I think we will have to remove the xla_build?
check and tell them that setting it to true requires an explicit call to mix deps.compile xla
. Another option is to move use config :xla, :force_build, true | false
, because we can at least encode that it compile_env
which can warn/raise if you change it and you don't recompile. But for now I would go with docs only. WDYT?
The config would only handle XLA_BUILD
changing, but what if XLA_TARGET
changes?
Updating the docs sounds good, though this change may cause some confusion for people relying on XLA_BUILD
already.
The issue is only with mix deps.compile xla
and we only call it with XLA_BUILD
is set. I will send a PR to make sure we are on the same page. :)
The EXLA build fails with:
Full Log
``` $ XLA_BUILD=true MIX_ENV=prod mix compile ==> xla Compiling 2 files (.ex) Generated xla app rm -f /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/compiler/xla/extension && \ ln -s "/workspaces/exla_compile_test/deps/xla/extension" /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/compiler/xla/extension && \ cd /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e && \ bazel build --define "framework_shared_object=false" -c opt //tensorflow/compiler/xla/extension:xla_extension && \ mkdir -p /root/.cache/xla/0.3.0/cache/build/ && \ cp -f /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/bazel-bin/tensorflow/compiler/xla/extension/xla_extension.tar.gz /root/.cache/xla/0.3.0/cache/build/xla_extension-x86_64-linux-cpu.tar.gz Extracting Bazel installation... Starting local Bazel server and connecting to it... INFO: Options provided by the client: Inherited 'common' options: --isatty=0 --terminal_columns=80 INFO: Reading rc options for 'build' from /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: Inherited 'common' options: --experimental_repo_remote_exec INFO: Reading rc options for 'build' from /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: 'build' options: --define framework_shared_object=true --java_toolchain=@tf_toolchains//toolchains/java:tf_java_toolchain --host_java_toolchain=@tf_toolchains//toolchains/java:tf_java_toolchain --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --deleted_packages=tensorflow/compiler/mlir/tfrt,tensorflow/compiler/mlir/tfrt/benchmarks,tensorflow/compiler/mlir/tfrt/jit/python_binding,tensorflow/compiler/mlir/tfrt/jit/transforms,tensorflow/compiler/mlir/tfrt/python_tests,tensorflow/compiler/mlir/tfrt/tests,tensorflow/compiler/mlir/tfrt/tests/analysis,tensorflow/compiler/mlir/tfrt/tests/jit,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_tfrt,tensorflow/compiler/mlir/tfrt/tests/tf_to_corert,tensorflow/compiler/mlir/tfrt/tests/tf_to_tfrt_data,tensorflow/compiler/mlir/tfrt/tests/saved_model,tensorflow/compiler/mlir/tfrt/transforms/lhlo_gpu_to_tfrt_gpu,tensorflow/core/runtime_fallback,tensorflow/core/runtime_fallback/conversion,tensorflow/core/runtime_fallback/kernel,tensorflow/core/runtime_fallback/opdefs,tensorflow/core/runtime_fallback/runtime,tensorflow/core/runtime_fallback/util,tensorflow/core/tfrt/common,tensorflow/core/tfrt/eager,tensorflow/core/tfrt/eager/backends/cpu,tensorflow/core/tfrt/eager/backends/gpu,tensorflow/core/tfrt/eager/core_runtime,tensorflow/core/tfrt/eager/cpp_tests/core_runtime,tensorflow/core/tfrt/fallback,tensorflow/core/tfrt/gpu,tensorflow/core/tfrt/run_handler_thread_pool,tensorflow/core/tfrt/runtime,tensorflow/core/tfrt/saved_model,tensorflow/core/tfrt/saved_model/tests,tensorflow/core/tfrt/tpu,tensorflow/core/tfrt/utils INFO: Found applicable config definition build:short_logs in file /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: --output_filter=DONT_MATCH_ANYTHING INFO: Found applicable config definition build:v2 in file /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1 INFO: Found applicable config definition build:linux in file /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: --copt=-w --host_copt=-w --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++14 --host_cxxopt=-std=c++14 --config=dynamic_kernels --distinct_host_configuration=false --experimental_guard_against_concurrent_changes INFO: Found applicable config definition build:dynamic_kernels in file /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS Loading: Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/c3e082762b7664bbc7ffd2c39e86464928e27c0c.tar.gz failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException GET returned 404 Not Found Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Loading: 0 packages loaded Analyzing: target //tensorflow/compiler/xla/extension:xla_extension (1 packages loaded, 0 targets configured) DEBUG: Rule 'io_bazel_rules_docker' indicated that a canonical reproducible form can be obtained by modifying arguments shallow_since = "1596824487 -0400" DEBUG: Repository io_bazel_rules_docker instantiated at: /root/.cache/xla_extension/tf-3f878cff5b698b82eea85db2b60d65a2e320850e/WORKSPACE:23:14: inHappens with
{:exla, "~> 0.2"}
on a new project. The compilation seems to work fine though. XLA service is initialized and StreamExecutor can find a device.No error is raised for subsequent compiles.