Closed Levaru closed 2 years ago
Thanks for reaching out!
This returned the following error: ...
Sorry, this was due to a hard-coded data directory mapping from the host filesystem to the docker container guest filesystem ($HOME/data
-> /app/data/
).
I've just fixed it in https://github.com/steven-lang/DAFNe/commit/f6c7a749ddd4603a937cc4b91fe72fcbd865b07b. You can now pass --data-dir
to the run.py
script to specify where your DOTA dataset is stored, so that the docker mapping is correct.
Please let me know if this fixed your issue.
I'm not really familiar with docker, do I need to run this command from inside the docker image?
No, the idea is that the docker image simply specifies the complete environment necessary to run the code, including all dependencies. The run.py
script then starts a new docker container from this docker image and runs the experiment wit the given settings specified with the run.py
arguments.
As described in the README, you can also run this without docker but need to ensure that the environment is correctly set up (CUDA, packages in requirements.txt, etc.).
But this just returns another error: ...
This looks like you are running the script with Python 2 which does not support type annotations.
But anyway, I recommend using the run.py
script :-)
only --gpus 0 would work, or else it will return a unknown device error
What is the output of nvidia-smi
on your machine, as well as when you run ./tools/run.py --gpus 0,1 --no-run nvidia-smi
(the second command starts a container and runs nvidia-smi
inside the container to see if the container has access to the specified GPUs)?
I've just fixed it in f6c7a74. You can now pass --data-dir to the run.py script to specify where your DOTA dataset is stored, so that the docker mapping is correct.
Please let me know if this fixed your issue.
Thank you for the very quick update! Sadly it didn't work, I'm still getting the same error. Is the DOTA1_train1024.json
supposed to be generated or do I have to provide it myself? Just FIY, I'm helping a colleague set this up, I don't know much about DL or datasets, etc.
What is the output of nvidia-smi on your machine, as well as when you run ./tools/run.py --gpus 0,1 --no-run nvidia-smi (the second command starts a container and runs nvidia-smi inside the container to see if the container has access to the specified GPUs)?
The output when running nvidia-smi
:
Wed Oct 27 10:54:38 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3070 On | 00000000:41:00.0 On | N/A |
| 0% 39C P8 19W / 240W | 371MiB / 7981MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1790 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 2203 G /usr/bin/gnome-shell 72MiB |
| 0 N/A N/A 2917 G /usr/lib/xorg/Xorg 135MiB |
| 0 N/A N/A 3059 G /usr/bin/gnome-shell 30MiB |
| 0 N/A N/A 3776 G ...AAAAAAAAA= --shared-files 109MiB |
+-----------------------------------------------------------------------------+
The output when running ./tools/run.py --gpus 0,1 --no-run nvidia-smi
:
docker info | grep 'Runtimes.*nvidia'
WARNING: No swap limit support
id -u
id -g
docker info | grep 'rootless'
WARNING: No swap limit support
docker run --shm-size=1024m --interactive -t --rm --name dafne_default-1635324857 --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0,1 --user 1000:1000 --volume /home/ceres/git/DAFNe:/app/dafne:z --volume /home/ceres/data:/app/data/:z --volume /home/ceres/models:/app/models/:z --volume /home/ceres/results:/app/results/:z --volume /home/ceres/.torch/detectron2:/app/.torch/detectron2:z -e DAFNE_DATA_DIR=/app/data -e PYTHONPATH=./ -e FVCORE_CACHE=/app/.torch -e EMAIL_CREDENTIALS=/app/dafne/.mail dafne nvidia-smi
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: 1: unknown device: unknown.
And the same command but with --gpus
set only to 0:
docker info | grep 'Runtimes.*nvidia'
WARNING: No swap limit support
id -u
id -g
docker info | grep 'rootless'
WARNING: No swap limit support
docker run --shm-size=1024m --interactive -t --rm --name dafne_default-1635324863 --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 --user 1000:1000 --volume /home/ceres/git/DAFNe:/app/dafne:z --volume /home/ceres/data:/app/data/:z --volume /home/ceres/models:/app/models/:z --volume /home/ceres/results:/app/results/:z --volume /home/ceres/.torch/detectron2:/app/.torch/detectron2:z -e DAFNE_DATA_DIR=/app/data -e PYTHONPATH=./ -e FVCORE_CACHE=/app/.torch -e EMAIL_CREDENTIALS=/app/dafne/.mail dafne nvidia-smi
[Omitted license]
Wed Oct 27 08:54:24 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3070 On | 00000000:41:00.0 On | N/A |
| 0% 40C P0 39W / 240W | 370MiB / 7981MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Sadly it didn't work, I'm still getting the same error. Is the DOTA1_train1024.json supposed to be generated or do I have to provide it myself?
Oh I see. You first need to download and pre-process the DOTA dataset. You can find instructions on the dataset homepage, here.
The output when running nvidia-smi: ...
Okay, so your system seems to only have a single GPU (GeForce RTX 3070, index 0
). Therefore, running ./tools/run.py --gpus 0,1
tries to address GPUs with index 0
and 1
. Since the GPU with index 1
does not exist, you see the error message device error: 1: unknown device: unknown.
.
And the same command but with --gpus set only to 0: ...
This shows, that your docker setup works (the container has access to the single GPU in your system) and the experiments could in theory be started (if you now also download and pre-process the dataset).
Ah ok, I understand it now! I was confused about the gpu count, thinking that it meant maybe something like cores or similar. Thank you very much for the help.
I build the docker image with
docker build -t dafne .
After that I tried running the command from the readme (only --gpus 0 would work, or else it will return a unknown device error):./tools/run.py --gpus 0 --config-file ./configs/dota-1.0/1024.yaml
This returned the following error:
I'm not really familiar with docker, do I need to run this command from inside the docker image? I tried to run the command without docker from the readme:
NVIDIA_VISIBLE_DEVICES=0 ./tools/plain_train_net.py --num-gpus 1 --config-file ./configs/dota-1.0/1024.yaml
But this just returns another error: