No such file or directory error when running train.py

Levaru commented 2 years ago

I build the docker image with docker build -t dafne . After that I tried running the command from the readme (only --gpus 0 would work, or else it will return a unknown device error): ./tools/run.py --gpus 0 --config-file ./configs/dota-1.0/1024.yaml

This returned the following error:

Traceback:
  File "./tools/plain_train_net.py", line 597, in main
    do_train(cfg, model, resume=args.resume)
  File "./tools/plain_train_net.py", line 430, in do_train
    data_loader_train = build_train_loader(cfg)
  File "./tools/plain_train_net.py", line 274, in build_train_loader
    return build_detection_train_loader(cfg, mapper=mapper)
  File "/app/detectron2_repo/detectron2/config/config.py", line 207, in wrapped
    explicit_args = _get_args_from_config(from_config, *args, **kwargs)
  File "/app/detectron2_repo/detectron2/config/config.py", line 245, in _get_args_from_config
    ret = from_config_func(*args, **kwargs)
  File "/app/detectron2_repo/detectron2/data/build.py", line 337, in _train_loader_from_config
    dataset = get_detection_dataset_dicts(
  File "/app/detectron2_repo/detectron2/data/build.py", line 240, in get_detection_dataset_dicts
    dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in names]
  File "/app/detectron2_repo/detectron2/data/build.py", line 240, in <listcomp>
    dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in names]
  File "/app/detectron2_repo/detectron2/data/catalog.py", line 58, in get
    return f()
  File "/app/dafne/dafne/data/datasets/dota.py", line 346, in <lambda>
    lambda: load_dota_json(
  File "/app/dafne/dafne/data/datasets/dota.py", line 89, in load_dota_json
    coco_api = COCO(json_file)
  File "/opt/conda/lib/python3.8/site-packages/pycocotools/coco.py", line 84, in __init__
    with open(annotation_file, 'r') as f:

Error: [Errno 2] No such file or directory: '/app/data/dota_1_split/train1024/DOTA1_train1024.json'

I'm not really familiar with docker, do I need to run this command from inside the docker image? I tried to run the command without docker from the readme: NVIDIA_VISIBLE_DEVICES=0 ./tools/plain_train_net.py --num-gpus 1 --config-file ./configs/dota-1.0/1024.yaml

But this just returns another error:

  File "./tools/plain_train_net.py", line 90
    def build_optimizer_custom(cfg: CfgNode, model: torch.nn.Module) -> torch.optim.Optimizer:
                                  ^
SyntaxError: invalid syntax

braun-steven commented 2 years ago

Thanks for reaching out!

This returned the following error: ...

Sorry, this was due to a hard-coded data directory mapping from the host filesystem to the docker container guest filesystem ($HOME/data -> /app/data/).

I've just fixed it in https://github.com/steven-lang/DAFNe/commit/f6c7a749ddd4603a937cc4b91fe72fcbd865b07b. You can now pass --data-dir to the run.py script to specify where your DOTA dataset is stored, so that the docker mapping is correct.

Please let me know if this fixed your issue.

I'm not really familiar with docker, do I need to run this command from inside the docker image?

No, the idea is that the docker image simply specifies the complete environment necessary to run the code, including all dependencies. The run.py script then starts a new docker container from this docker image and runs the experiment wit the given settings specified with the run.py arguments.

As described in the README, you can also run this without docker but need to ensure that the environment is correctly set up (CUDA, packages in requirements.txt, etc.).

But this just returns another error: ...

This looks like you are running the script with Python 2 which does not support type annotations.

But anyway, I recommend using the run.py script :-)

only --gpus 0 would work, or else it will return a unknown device error

What is the output of nvidia-smi on your machine, as well as when you run ./tools/run.py --gpus 0,1 --no-run nvidia-smi (the second command starts a container and runs nvidia-smi inside the container to see if the container has access to the specified GPUs)?

Levaru commented 2 years ago

I've just fixed it in f6c7a74. You can now pass --data-dir to the run.py script to specify where your DOTA dataset is stored, so that the docker mapping is correct.

Please let me know if this fixed your issue.

Thank you for the very quick update! Sadly it didn't work, I'm still getting the same error. Is the DOTA1_train1024.json supposed to be generated or do I have to provide it myself? Just FIY, I'm helping a colleague set this up, I don't know much about DL or datasets, etc.

What is the output of nvidia-smi on your machine, as well as when you run ./tools/run.py --gpus 0,1 --no-run nvidia-smi (the second command starts a container and runs nvidia-smi inside the container to see if the container has access to the specified GPUs)?

The output when running nvidia-smi:

Wed Oct 27 10:54:38 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3070    On   | 00000000:41:00.0  On |                  N/A |
|  0%   39C    P8    19W / 240W |    371MiB /  7981MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1790      G   /usr/lib/xorg/Xorg                 18MiB |
|    0   N/A  N/A      2203      G   /usr/bin/gnome-shell               72MiB |
|    0   N/A  N/A      2917      G   /usr/lib/xorg/Xorg                135MiB |
|    0   N/A  N/A      3059      G   /usr/bin/gnome-shell               30MiB |
|    0   N/A  N/A      3776      G   ...AAAAAAAAA= --shared-files      109MiB |
+-----------------------------------------------------------------------------+

The output when running ./tools/run.py --gpus 0,1 --no-run nvidia-smi:

docker info | grep 'Runtimes.*nvidia'
WARNING: No swap limit support
id -u
id -g
docker info | grep 'rootless'
WARNING: No swap limit support
docker run --shm-size=1024m --interactive -t --rm --name dafne_default-1635324857 --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0,1 --user 1000:1000 --volume /home/ceres/git/DAFNe:/app/dafne:z --volume /home/ceres/data:/app/data/:z --volume /home/ceres/models:/app/models/:z --volume /home/ceres/results:/app/results/:z --volume /home/ceres/.torch/detectron2:/app/.torch/detectron2:z -e DAFNE_DATA_DIR=/app/data -e PYTHONPATH=./ -e FVCORE_CACHE=/app/.torch -e EMAIL_CREDENTIALS=/app/dafne/.mail dafne nvidia-smi
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: 1: unknown device: unknown.

And the same command but with --gpus set only to 0:

docker info | grep 'Runtimes.*nvidia'
WARNING: No swap limit support
id -u
id -g
docker info | grep 'rootless'
WARNING: No swap limit support
docker run --shm-size=1024m --interactive -t --rm --name dafne_default-1635324863 --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 --user 1000:1000 --volume /home/ceres/git/DAFNe:/app/dafne:z --volume /home/ceres/data:/app/data/:z --volume /home/ceres/models:/app/models/:z --volume /home/ceres/results:/app/results/:z --volume /home/ceres/.torch/detectron2:/app/.torch/detectron2:z -e DAFNE_DATA_DIR=/app/data -e PYTHONPATH=./ -e FVCORE_CACHE=/app/.torch -e EMAIL_CREDENTIALS=/app/dafne/.mail dafne nvidia-smi

[Omitted license]

Wed Oct 27 08:54:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3070    On   | 00000000:41:00.0  On |                  N/A |
|  0%   40C    P0    39W / 240W |    370MiB /  7981MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

braun-steven commented 2 years ago

Sadly it didn't work, I'm still getting the same error. Is the DOTA1_train1024.json supposed to be generated or do I have to provide it myself?

Oh I see. You first need to download and pre-process the DOTA dataset. You can find instructions on the dataset homepage, here.

The output when running nvidia-smi: ...

Okay, so your system seems to only have a single GPU (GeForce RTX 3070, index 0). Therefore, running ./tools/run.py --gpus 0,1 tries to address GPUs with index 0 and 1. Since the GPU with index 1 does not exist, you see the error message device error: 1: unknown device: unknown..

And the same command but with --gpus set only to 0: ...

This shows, that your docker setup works (the container has access to the single GPU in your system) and the experiments could in theory be started (if you now also download and pre-process the dataset).

Levaru commented 2 years ago

Ah ok, I understand it now! I was confused about the gpu count, thinking that it meant maybe something like cores or similar. Thank you very much for the help.

braun-steven / DAFNe

No such file or directory error when running train.py #1