Failed to run a code segment from alpa.test_install, saying that executable is built for device CUDA:0 of type "NVIDIA A100 80GB PCIe"; cannot run it on device CUDA:1 of type "NVIDIA A800 80GB PCIe.
I display the log using ALPA_DEBUG_PRINT_AS_STRATEGY=1, but the log looks too complex to understand for me......
Here is the detail log. test.log
Where is the problem probably from? Is there any trick to debug?
Thank you all!
Please describe the expected behavior
System information and environment
OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker): docker nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
Python version: 3.8.10
CUDA version: 11.7
NCCL version: 2.9.9
cupy version: 10.6.0
GPU model and memory: 2 * A100 (80GB) GPU on a 2-way Intel machine, main memory is 768GB.
Alpa version: v0.2.3
TensorFlow version: Last commit is c079cc2 from Yonghao Zhuang, Sun Feb 5 18:29:37 2023.
JAX version: 0.3.22
To Reproduce
run the code
See error
Screenshots
Code snippet to reproduce the problem
#!/bin/python
import os
import unittest
from alpa import (parallelize, ShardParallel)
from alpa.testing import get_mlp_train_state_and_step
state, batch, train_step = get_mlp_train_state_and_step(batch_size=128,
hidden_size=128,
num_layers=4)
p_train_step = parallelize(train_step,
method=ShardParallel(num_micro_batches=2))
actual_output = p_train_step(state, batch)
Additional information
Backtrace from GDB: We can see in the function devices_equivalent, device_ordinal_a is not equal to device_ordinal_b.
In the frame 1 LocalExecutable::ValidateExecutionOptions, run_options.device_ordinal() initially is -1 indicating not setted. And then it is assigned to run_options.stream()->parent()->device_ordinal(), 1, which is not equal to the build_options_.device_ordinal(), 0.
So an error is throwed.
#0 xla::Backend::devices_equivalent (this=0x3073c40, device_ordinal_a=1, device_ordinal_b=0)
at external/org_tensorflow/tensorflow/compiler/xla/service/backend.cc:193
#1 0x00007ff7a3ed7a20 in xla::LocalExecutable::ValidateExecutionOptions (this=0x32df7b00,
run_options=..., backend=...)
at external/org_tensorflow/tensorflow/compiler/xla/client/local_client.cc:88
#2 0x00007ff7a3ed8177 in xla::LocalExecutable::RunHelper (this=0x32df7b00,
argument_shapes=..., run_options=...)
at external/org_tensorflow/tensorflow/compiler/xla/client/local_client.cc:149
#3 0x00007ff7a3ed99cd in xla::LocalExecutable::RunAsync (this=0x32df7b00,
argument_host_shapes=..., arguments=std::vector of length 0, capacity 0, run_options=...)
at external/org_tensorflow/tensorflow/compiler/xla/client/local_client.cc:292
#4 0x00007ff7a3eda0c0 in xla::LocalExecutable::RunAsync (this=0x32df7b00,
arguments=std::vector of length 0, capacity 0, run_options=...)
at external/org_tensorflow/tensorflow/compiler/xla/client/local_client.cc:332
#5 0x00007ff7a338d871 in xla::PjRtStreamExecutorExecutable::EnqueueExecution(absl::lts_20220623::Span<xla::PjRtBuffer* const>, int, int, int, xla::RunId const&, xla::ExecuteOptions const&, xla::PjRtDevice*, std::vector<xla::PjRtStreamExecutorBuffer::ScopedHold, std::allocator<xla::PjRtStreamExecutorBuffer::ScopedHold> >*, std::shared_ptr<xla::DeviceAssignment>, std::vector<std
::function<void ()>, std::allocator<std::function<void ()> > >&) const (this=0x3270fb10,
argument_handles=..., replica=0, partition=1, executable_idx=0, run_id=..., options=...,
device=0x3199220, device_buffers=0x7ff2bf7fc8e0,
device_assignment=std::shared_ptr<xla::DeviceAssignment> (use count 3, weak count 0) = {...}, compute_callbacks=std::vector of length 0, capacity 0)
at external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2030
#6 0x00007ff7a338f07a in xla::PjRtStreamExecutorExecutable::ExecuteHelper (this=0x3270fb10,
argument_handles=..., replica=0, partition=1, run_id=..., options=...,
fill_future=false, device=0x3199220)
at external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153
#7 0x00007ff7a338fe7f in operator() (__closure=0x347cbe60)
at external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2270
#8 0x00007ff7a3399ee2 in std::__invoke_impl<void, xla::PjRtStreamExecutorExecutable::Execute(absl::lts_20220623::Span<const std::vector<xla::PjRtBuffer*> >, const xla::ExecuteOptions&, std:
:optional<std::vector<xla::PjRtFuture<tsl::Status> > >&)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#9 0x00007ff7a33983ea in std::__invoke_r<void, xla::PjRtStreamExecutorExecutable::Execute(absl::lts_20220623::Span<const std::vector<xla::PjRtBuffer*> >, const xla::ExecuteOptions&, std::
optional<std::vector<xla::PjRtFuture<tsl::Status> > >&)::<lambda()>&>(struct {...} &) ( __fn=...) at /usr/include/c++/10/bits/invoke.h:110
#10 0x00007ff7a33961d3 in std::_Function_handler<void(), xla::PjRtStreamExecutorExecutable::Execute(absl::lts_20220623::Span<const std::vector<xla::PjRtBuffer*> >, const xla::ExecuteOptions&
, std::optional<std::vector<xla::PjRtFuture<tsl::Status> > >&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/10/bits/std_function.h:291
#11 0x00007ff7a0719526 in std::function<void ()>::operator()() const (this=0x7ff2bf7fddb0)
at /usr/include/c++/10/bits/std_function.h:622
#12 0x00007ff7a3e87436 in xla::WorkerThread::WorkLoop (this=0x3196590)
at external/org_tensorflow/tensorflow/compiler/xla/pjrt/worker_thread.cc:50
#13 0x00007ff7a3e86fcb in operator() (__closure=0x3196880)
at external/org_tensorflow/tensorflow/compiler/xla/pjrt/worker_thread.cc:22
#14 0x00007ff7a3e877d2 in std::__invoke_impl<void, xla::WorkerThread::WorkerThread(tsl::Env*, const string&)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...)
at /usr/include/c++/10/bits/invoke.h:60
#15 0x00007ff7a3e876c2 in std::__invoke_r<void, xla::WorkerThread::WorkerThread(tsl::Env*, const string&)::<lambda()>&>(struct {...} &) (__fn=...) at /usr/include/c++/10/bits/invoke.h:110
#16 0x00007ff7a3e875b0 in std::_Function_handler<void(), xla::WorkerThread::WorkerThread(tsl::Env*, const string&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
at /usr/include/c++/10/bits/std_function.h:291
#17 0x00007ff7a0719526 in std::function<void ()>::operator()() const (this=0x3196880)
at /usr/include/c++/10/bits/std_function.h:622
#18 0x00007ff7ac1ea332 in tsl::(anonymous namespace)::PThread::ThreadFn (
params_arg=0x3196860)
at external/org_tensorflow/tensorflow/tsl/platform/default/env.cc:93
#19 0x00007ff8ff258609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#20 0x00007ff8ff392133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Please describe the bug
Hi all,
Failed to run a code segment from alpa.test_install, saying that
executable is built for device CUDA:0 of type "NVIDIA A100 80GB PCIe"; cannot run it on device CUDA:1 of type "NVIDIA A800 80GB PCIe.
I display the log using
ALPA_DEBUG_PRINT_AS_STRATEGY=1
, but the log looks too complex to understand for me...... Here is the detail log. test.logWhere is the problem probably from? Is there any trick to debug?
Thank you all!
Please describe the expected behavior
System information and environment
To Reproduce
Screenshots
Code snippet to reproduce the problem
Additional information
Backtrace from GDB: We can see in the function
devices_equivalent
,device_ordinal_a
is not equal todevice_ordinal_b
.In the frame 1
LocalExecutable::ValidateExecutionOptions
,run_options.device_ordinal()
initially is-1
indicating not setted. And then it is assigned torun_options.stream()->parent()->device_ordinal()
,1
, which is not equal to thebuild_options_.device_ordinal()
,0
.So an error is throwed.