Processor-NVIDIA_A100-SXM4-80GB
modinfo: ERROR: Module alias nvidia not found.
Batchsize for VRAM size '80GB' not optimized
--optimizer=sgd --model=resnet50 --num_gpus=1 --batch_size= --variable_update=replicated --distortions=false --num_batches=100 --data_name=imagenet --all_reduce_spec=nccl
2022-11-13 01:52:00.900420: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
FATAL Flags parsing error: flag --batch_size=: invalid literal for int() with base 10: ''
Pass --helpshort or --helpfull to see help on flags.
From digging in the bash code it seems that $batch_size is an evaluation of $model:
# approximately line 163 of benchmark.sh
eval batch_size=\$$model
Which is being passed in via run_benchmarks_all...
run_benchmark_all() {
for model in $MODELS; do
for num_gpus in $(seq ${MAX_NUM_GPU} -1 ${MIN_NUM_GPU}); do
for iter in $(seq 1 $ITERATIONS); do
run_benchmark
sleep 10
done
done
done
}
So then somehow MODELS is not being read from the config. I'll be investigating further, but if you happen to know anything I'm doing wrong, then please say so.
I'm using the lambdalabs docker images to build (https://github.com/lambdal/lambda-stack-dockerfiles/).
This is the command I run:
run_nvidia_benchmark.sh
contains the following:However, this is the error I get:
From digging in the bash code it seems that
$batch_size
is an evaluation of$model
:Which is being passed in via
run_benchmarks_all
...So then somehow MODELS is not being read from the config. I'll be investigating further, but if you happen to know anything I'm doing wrong, then please say so.
I've ensured that docker can detect the gpus: