LambdaColdStorage / lambda-tensorflow-benchmark

BSD 3-Clause "New" or "Revised" License
234 stars 60 forks source link

I'm not able to run the benchmarks...running into a few issues #25

Open src-r-r opened 1 year ago

src-r-r commented 1 year ago

I'm using the lambdalabs docker images to build (https://github.com/lambdal/lambda-stack-dockerfiles/).

This is the command I run:

sudo nvidia-docker run -v /dev/nvidia0:/dev/nvidia0 -v $(pwd)/lambda-tensorflow-benchmark:/dockerx -v $(pwd)/run_nvidia_benchmark.sh:/run_nvidia_benchmark.sh -it --gpus 1 lambda-stack:20.04 /run_nvidia_benchmark.sh

run_nvidia_benchmark.sh contains the following:

#!/bin/bash

export DEBIAN_FRONTEND=noninteractive
export LTB=lambda-tensorflow-benchmark
export WORKDIR=${WORKDIR:-/dockerx}

apt update && apt install -y 'libcudart10.1' kmod
pip install matplotlib tensorflow-gpu==2.3.1

git config --global --add safe.directory ${WORKDIR}
cd ${WORKDIR}

TF_XLA_FLAGS=--tf_xla_auto_jit=2 \
    ./batch_benchmark.sh 1 1 \
    1 100 \
    2 \
    config/config_resnet50_replicated_fp32_train_syn

However, this is the error I get:

Processor-NVIDIA_A100-SXM4-80GB
modinfo: ERROR: Module alias nvidia not found.
Batchsize for VRAM size '80GB' not optimized
--optimizer=sgd --model=resnet50 --num_gpus=1 --batch_size= --variable_update=replicated --distortions=false --num_batches=100 --data_name=imagenet --all_reduce_spec=nccl
2022-11-13 01:52:00.900420: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
FATAL Flags parsing error: flag --batch_size=: invalid literal for int() with base 10: ''
Pass --helpshort or --helpfull to see help on flags.

From digging in the bash code it seems that $batch_size is an evaluation of $model:

  # approximately line 163 of benchmark.sh
  eval batch_size=\$$model

Which is being passed in via run_benchmarks_all...

run_benchmark_all() {
  for model in $MODELS; do
    for num_gpus in $(seq ${MAX_NUM_GPU} -1 ${MIN_NUM_GPU}); do
      for iter in $(seq 1 $ITERATIONS); do
        run_benchmark
        sleep 10
      done
    done
  done
}

So then somehow MODELS is not being read from the config. I'll be investigating further, but if you happen to know anything I'm doing wrong, then please say so.

I've ensured that docker can detect the gpus:

$ sudo docker run --gpus 1 lambda-stack:20.04 python3 -c "import torch; print(torch.cuda.device_count())"
1