LLNL / lbann

Livermore Big Artificial Neural Network Toolkit
http://software.llnl.gov/lbann/
Other
224 stars 79 forks source link

How to generate prototext? #1677

Open gitosu67 opened 3 years ago

gitosu67 commented 3 years ago

I am trying to run the sample file provided in the repo: https://github.com/LLNL/lbann/blob/develop/applications/vision/lenet.py using the command: mpiexec lbann --model=lenet.prototext --reader=https://github.com/LLNL/lbann/tree/develop/applications/vision/data/mnist/data_reader.prototext. Now I want to generate the lenet.prototext from the given lenet.py. Is this possible or am I missing something here? I just want to train the provided lenet on the mnist dataset.

if I just try: python3 lenet.py I get errors as: RuntimeError: could not detect job scheduler.

timmoon10 commented 3 years ago

Try replacing the lbann.contrib.launcher.run with lbann.proto.save_prototext:

https://github.com/LLNL/lbann/blob/9c94701e30b83a76c252e1a0b4df97b2b7d11021/python/lbann/proto.py#L7

Something like:

lbann.proto.save_prototext(prototext_file,
                           trainer=trainer,
                           model=model,
                           data_reader=data_reader,
                           optimizer=opt)

The Python frontend assumes you are running LBANN on a system that uses SLURM or LSF job managers. We should add a fallback for MPI.

gitosu67 commented 3 years ago

@timmoon10 Yes, that works and now I am running lbann as: mpiexec lbann --prototext=exp.prototext. exp.prototext has been generated using the command above in the provided lenet.py file.

The training seems to run but I am stuck here for an hour now:

[0] Epoch : stats formated [tr/v/te] iter/epoch = [844/94/157]
global MB = [  64/  64/  64] global last MB = [  48  /  48  /  16  ]
local MB = [  64/  64/  64]  local last MB = [  48+0/  48+0/  16+0

Is this expected or am I doing something wrong? I am not using GPU in this case but there is no progress bar of any sort so I am not sure if the model is training or not.

timmoon10 commented 3 years ago

An hour seems really excessive for LeNet. I suspect something is hanging. It's odd, since it should just run with one MPI rank if you don't pass in extra arguments.

Can you add lbann.CallbackDebug at

https://github.com/LLNL/lbann/blob/1b1e3198853566f7417a1dd2477d2e6c4217e6e7/applications/vision/lenet.py#L79

This will printf at the beginning and end of every layer. That can give us an idea of what's hanging.

gitosu67 commented 3 years ago

@timmoon10 @benson31 I noticed that line was already added. I have attached my log which contains whatever gets printed in the console. I terminated the process because it gets stuck after starting the epoch and takes a long time to process. log.txt

Another question here is, how to use GPU for running lbann framework? I installed lbann+cuda now using: spack install lbann+cuda+nccl~al and I loaded the cuda modules, but when I run the prototext file it does not detect cuda since the epochs take a long time in general. Is there anything else that needs to be done for the process to run on a gpu?

timmoon10 commented 3 years ago

I don't see the debug callback in the log. At the line I gave you, we configure the model with three callbacks to print the model description, metrics, and times. Can you add a fourth callback (lbann.CallbackDebug) to the list? Also, it looks like you're running three instances of LBANN at the same time, each one running with 1 MPI rank? I don't think it should cause problems (other than mangling the log file), but I'm wondering if something is misconfigured.

When we move on to running with GPUs, can you try building with cuDNN and Aluminum enabled? cuDNN is required for GPU support and Aluminum is highly recommended for GPU communication.

gitosu67 commented 3 years ago

@timmoon10 to run on GPU I am building: spack install lbann@0.101+cuda+nccl ^aluminum+cuda@0.4.0 ^hydrogen+cuda@1.4.0 ^conduit~fortran ^hwloc@1.9 But after installing and loading the modules using: spack load lbann-(packagename) (also loading aluminum, hydrogen,cuda and cudnn using spack), I get a different error as follows:


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
terminate called after throwing an instance of 'hydrogen::CUDAError'
  what():  Assertion
    h_check_cuda_error_code__ == cudaSuccess
in function
    void hydrogen::gpu::SetDevice(int)
failed!
{
    File: /tmp/pbstmp.11449746/jrodmanu/spack-stage/spack-stage-hydrogen-1.4.0-csf4wubjm674hlj6emasox7emmnwbqfl/spack-src/src/hydrogen/device/CUDA.cpp
    Line: 91
    Mesg: CUDA error detected in command: "cudaSetDevice(device_id)"
    Error Code: 3
    Error Name: cudaErrorInitializationError
    Error Mesg: initialization error
}
timmoon10 commented 3 years ago

I'm not too familiar with the build system and the main developer is on vacation for the rest of the week, but I'll give it a shot. Is your Spack environment or system environment configured correctly? It looks like CUDA is picking up the wrong Nvidia driver, so maybe you need to add /usr/lib or /usr/lib64 to your LD_LIBRARY_PATH before running LBANN. If that doesn't fix it, I would try getting a simple "hello world" CUDA program to work with the CUDA installation in your Spack environment.

Pinging @benson31 and @bvanessen.

gitosu67 commented 3 years ago

I did but still getting the same error. It might be a version problem but the following shows when I domodule list:

1) xalt/latest                                       9) lbann-0.101-gcc-8.4.0-3uaigsn
  2) gcc-compatibility/8.4.0                          10) hydrogen-1.4.0-gcc-8.4.0-smil2g7
  3) intel/19.0.5                                     11) hydrogen-1.4.0-gcc-8.4.0-csf4wub
  4) modules/sp2020                                   12) hydrogen-1.4.0-gcc-8.4.0-cssa7de
  5) lbann-0.101-gcc-8.4.0-pdo7mw4                    13) openmpi/4.0.3
  6) hydrogen-1.4.0-gcc-8.4.0-vhazpqq                 14) aluminum-0.4.0-gcc-8.4.0-rrqoi7d
  7) aluminum-0.4.0-gcc-8.4.0-vpq3wyz                 15) nccl-2.7.8-1-gcc-8.4.0-47lyinw
  8) cudnn-8.0.4.30-11.0-linux-x64-gcc-8.4.0-n2fy4nf  16) cuda/10.2.89

I have tried experimenting with different versions of lbann so there are quite a few modules for that. I have loaded all of them using modules. Is that causing a problem?

I tested a sample 'hello world' in cuda and that works!

__global__ void cuda_hello(){
    printf("Hello World from GPU!\n");
}

int main() {
    cuda_hello<<<1,1>>>(); 
    return 0;
}
timmoon10 commented 3 years ago

Your setup looks sensible to me. In my workflow I build the dependencies in a Spack environment and build LBANN with CMake, and I just need to load one modulefile before running LBANN:

. ${spack_root}/share/spack/setup-env.sh
spack env activate -p lbann-dev-power9le
module use ${module_dir}
module load lbann-0.102.0

It's different than your setup since I'm using a modulefile produced by LBANN rather than Spack, so I'm unsure how applicable this is.

If we want to wait on debugging the GPU build issues until @benson31 gets back, we can try working out the hang in the non-GPU version instead.