alpa-projects / alpa

Training and serving large-scale neural networks with auto parallelization.
https://alpa.ai
Apache License 2.0
3.08k stars 357 forks source link

Fail to run alpa.test_install #946

Closed weinanliu closed 1 year ago

weinanliu commented 1 year ago

Please describe the bug

Hi all,

Failed to run a code segment from alpa.test_install, saying that executable is built for device CUDA:0 of type "NVIDIA A100 80GB PCIe"; cannot run it on device CUDA:1 of type "NVIDIA A800 80GB PCIe.

I display the log using ALPA_DEBUG_PRINT_AS_STRATEGY=1, but the log looks too complex to understand for me...... Here is the detail log. test.log

Where is the problem probably from? Is there any trick to debug?

Thank you all!

Please describe the expected behavior

System information and environment

To Reproduce

  1. run the code
  2. See error

Screenshots

image

Code snippet to reproduce the problem

#!/bin/python 
import os
import unittest

from alpa import (parallelize, ShardParallel)
from alpa.testing import get_mlp_train_state_and_step

state, batch, train_step = get_mlp_train_state_and_step(batch_size=128,
        hidden_size=128,
        num_layers=4)

p_train_step = parallelize(train_step,
        method=ShardParallel(num_micro_batches=2))
actual_output = p_train_step(state, batch)

Additional information

Backtrace from GDB: We can see in the function devices_equivalent, device_ordinal_a is not equal to device_ordinal_b.

In the frame 1 LocalExecutable::ValidateExecutionOptions, run_options.device_ordinal() initially is -1 indicating not setted. And then it is assigned to run_options.stream()->parent()->device_ordinal(), 1, which is not equal to the build_options_.device_ordinal(), 0.

So an error is throwed.

#0  xla::Backend::devices_equivalent (this=0x3073c40, device_ordinal_a=1, device_ordinal_b=0)                                                                                                 
    at external/org_tensorflow/tensorflow/compiler/xla/service/backend.cc:193                                                                                                                 
#1  0x00007ff7a3ed7a20 in xla::LocalExecutable::ValidateExecutionOptions (this=0x32df7b00,                                                                                                    
    run_options=..., backend=...)                                                                                                                                                             
    at external/org_tensorflow/tensorflow/compiler/xla/client/local_client.cc:88                                                                                                              
#2  0x00007ff7a3ed8177 in xla::LocalExecutable::RunHelper (this=0x32df7b00,                                                                                                                   
    argument_shapes=..., run_options=...)                                                                                                                                                     
    at external/org_tensorflow/tensorflow/compiler/xla/client/local_client.cc:149                                                                                                             
#3  0x00007ff7a3ed99cd in xla::LocalExecutable::RunAsync (this=0x32df7b00,                                                                                                                    
    argument_host_shapes=..., arguments=std::vector of length 0, capacity 0, run_options=...)                                                                                                 
    at external/org_tensorflow/tensorflow/compiler/xla/client/local_client.cc:292                                                                                                             
#4  0x00007ff7a3eda0c0 in xla::LocalExecutable::RunAsync (this=0x32df7b00,                                                                                                                    
    arguments=std::vector of length 0, capacity 0, run_options=...)                                                                                                                           
    at external/org_tensorflow/tensorflow/compiler/xla/client/local_client.cc:332                                                                                                             
#5  0x00007ff7a338d871 in xla::PjRtStreamExecutorExecutable::EnqueueExecution(absl::lts_20220623::Span<xla::PjRtBuffer* const>, int, int, int, xla::RunId const&, xla::ExecuteOptions const&, xla::PjRtDevice*, std::vector<xla::PjRtStreamExecutorBuffer::ScopedHold, std::allocator<xla::PjRtStreamExecutorBuffer::ScopedHold> >*, std::shared_ptr<xla::DeviceAssignment>, std::vector<std
::function<void ()>, std::allocator<std::function<void ()> > >&) const (this=0x3270fb10, 
    argument_handles=..., replica=0, partition=1, executable_idx=0, run_id=..., options=...,                                                                                                  
    device=0x3199220, device_buffers=0x7ff2bf7fc8e0,                                                                                                                                          
    device_assignment=std::shared_ptr<xla::DeviceAssignment> (use count 3, weak count 0) = {...}, compute_callbacks=std::vector of length 0, capacity 0)
    at external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2030             
#6  0x00007ff7a338f07a in xla::PjRtStreamExecutorExecutable::ExecuteHelper (this=0x3270fb10,   
    argument_handles=..., replica=0, partition=1, run_id=..., options=...,                   
    fill_future=false, device=0x3199220)                                                         
    at external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153                                                                                               
#7  0x00007ff7a338fe7f in operator() (__closure=0x347cbe60)                                    
    at external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2270
#8  0x00007ff7a3399ee2 in std::__invoke_impl<void, xla::PjRtStreamExecutorExecutable::Execute(absl::lts_20220623::Span<const std::vector<xla::PjRtBuffer*> >, const xla::ExecuteOptions&, std:
:optional<std::vector<xla::PjRtFuture<tsl::Status> > >&)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...) at /usr/include/c++/10/bits/invoke.h:60                                 
#9  0x00007ff7a33983ea in std::__invoke_r<void, xla::PjRtStreamExecutorExecutable::Execute(absl::lts_20220623::Span<const std::vector<xla::PjRtBuffer*> >, const xla::ExecuteOptions&, std::                                                                                                                           
optional<std::vector<xla::PjRtFuture<tsl::Status> > >&)::<lambda()>&>(struct {...} &) (                                                                                                           __fn=...) at /usr/include/c++/10/bits/invoke.h:110                                                                                                                                        
#10 0x00007ff7a33961d3 in std::_Function_handler<void(), xla::PjRtStreamExecutorExecutable::Execute(absl::lts_20220623::Span<const std::vector<xla::PjRtBuffer*> >, const xla::ExecuteOptions&
, std::optional<std::vector<xla::PjRtFuture<tsl::Status> > >&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/10/bits/std_function.h:291
#11 0x00007ff7a0719526 in std::function<void ()>::operator()() const (this=0x7ff2bf7fddb0)                                                                                                    
    at /usr/include/c++/10/bits/std_function.h:622                                                                                                                                            
#12 0x00007ff7a3e87436 in xla::WorkerThread::WorkLoop (this=0x3196590)                                                                                                                        
    at external/org_tensorflow/tensorflow/compiler/xla/pjrt/worker_thread.cc:50                                                                                                               
#13 0x00007ff7a3e86fcb in operator() (__closure=0x3196880)                                                                                                                                    
    at external/org_tensorflow/tensorflow/compiler/xla/pjrt/worker_thread.cc:22                                                                                                               
#14 0x00007ff7a3e877d2 in std::__invoke_impl<void, xla::WorkerThread::WorkerThread(tsl::Env*, const string&)::<lambda()>&>(std::__invoke_other, struct {...} &) (__f=...)
    at /usr/include/c++/10/bits/invoke.h:60                                                    
#15 0x00007ff7a3e876c2 in std::__invoke_r<void, xla::WorkerThread::WorkerThread(tsl::Env*, const string&)::<lambda()>&>(struct {...} &) (__fn=...) at /usr/include/c++/10/bits/invoke.h:110
#16 0x00007ff7a3e875b0 in std::_Function_handler<void(), xla::WorkerThread::WorkerThread(tsl::Env*, const string&)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/10/bits/std_function.h:291                                                                                                                                            
#17 0x00007ff7a0719526 in std::function<void ()>::operator()() const (this=0x3196880)                                                                                                         
    at /usr/include/c++/10/bits/std_function.h:622                                                                                                                                            
#18 0x00007ff7ac1ea332 in tsl::(anonymous namespace)::PThread::ThreadFn (                                                                                                                     
    params_arg=0x3196860)
    at external/org_tensorflow/tensorflow/tsl/platform/default/env.cc:93                                                                                                                      
#19 0x00007ff8ff258609 in start_thread (arg=<optimized out>) at pthread_create.c:477                                                                                                          
#20 0x00007ff8ff392133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95                    
weinanliu commented 1 year ago

The bug is from JAX/XLA. CUDA_VISIBLE_DEVICES=0 can bypass it.

https://github.com/google/jax/issues/1325

weinanliu commented 1 year ago

1