OFA-Sys / ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Apache License 2.0
981 stars 64 forks source link

is vgg sound test only working with multiple GPU:s? #41

Closed paapu88 closed 1 year ago

paapu88 commented 1 year ago

Dear developers, I'm running script

cd one_peace/run_scripts/vggsound
bash evaluate.sh

but getting errors related to torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Question: I have a single gpu machine. Is it possible to run the script with a single GPU? br. Markus

logicwong commented 1 year ago

Hi, we support single GPU evaluation. You can set the CUDA_VISIBLE_DEVICES=0 and GPUS_PER_NODE=1 in evaluate.sh

paapu88 commented 1 year ago

Yes, I had done that already but in amazon tesla v100 (a p3.2xlarge instance) My modified one_peace/run_scripts/vggsound/evaluate.sh:

#!/usr/bin/env bash

# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=6081
#export CUDA_VISIBLE_DEVICES=1,2,3
#export GPUS_PER_NODE=3
export CUDA_VISIBLE_DEVICES=0
export GPUS_PER_NODE=1

config_dir=../../run_scripts
path=../../checkpoints/finetune_vggsound.pt
task_name=vggsound
model_name=one_peace_classify
selected_cols=uniq_id,audio,text,duration
results_path=../../results/vggsound

data=/data/vggsound/vggsound_test.tsv
gen_subset='test'
torchrun --nproc_per_node=${GPUS_PER_NODE} --master_port=${MASTER_PORT} ../../evaluate.py \
    --config-dir=${config_dir} \
    --config-name=evaluate \
    common_eval.path=${path} \
    common_eval.results_path=${results_path} \
    task._name=${task_name} \
    model._name=${model_name} \
    dataset.gen_subset=${gen_subset} \
    dataset.batch_size=4 \
    common.bf16=false common.memory_efficient_bf16=false \
    common_eval.model_overrides="{'task': {'_name': '${task_name}', 'data': '${data}', 'selected_cols': '${selected_cols}', 'bpe_dir': '../../utils/BPE'}}"

Crashes with:

INFO:one_peace.data.tsv_reader:loaded /data/vggsound/vggsound_test.tsv
Traceback (most recent call last):
  File "/home/ubuntu/git/one_peace_unikie/one_peace/run_scripts/vggsound/../../evaluate.py", line 207, in <module>
    cli_main()
  File "/home/ubuntu/git/one_peace_unikie/one_peace/run_scripts/vggsound/../../evaluate.py", line 203, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/distributed/utils.py", line 356, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/distributed/utils.py", line 330, in distributed_main
    main(cfg, **kwargs)
  File "/home/ubuntu/git/one_peace_unikie/one_peace/run_scripts/vggsound/../../evaluate.py", line 134, in main
    for sample in progress:
  File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/logging/progress_bar.py", line 272, in __iter__
    for i, obj in enumerate(self.iterable, start=self.n):
  File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/data/iterators.py", line 57, in __next__
    x = next(self._itr)
  File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/data/iterators.py", line 744, in __next__
    raise item
  File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/data/iterators.py", line 674, in run
    for item in self._source:
  File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
soundfile.LibsndfileError: <exception str() failed>
[2023-11-23 08:58:54,065] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1423) of binary: /data/venvs/onepeace/bin/python
Traceback (most recent call last):
  File "/data/venvs/onepeace/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../../evaluate.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-23_08:58:54
  host      : ip-172-31-24-92.eu-central-1.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1423)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
logicwong commented 1 year ago

It's a bit strange, try adding dataset.num_workers=0?

paapu88 commented 1 year ago

Tried that with no change in error. Also tried a bigger machine with 4GPU (amazon p3.8xlarge), with your original settings and with one more GPU: same result.

Could 64GB of GPU-ram be still too little?

I have used ubuntu 22.04 and python3.10 pip freeze says the following:

(onepeace) ubuntu@ip-172-31-38-68:~/git/one_peace_unikie/one_peace/run_scripts/vggsound$ pip freeze
antlr4-python3-runtime==4.8
audioread==3.0.1
bitarray==2.8.3
certifi==2023.11.17
cffi==1.16.0
charset-normalizer==2.1.1
colorama==0.4.6
Cython==3.0.5
decorator==5.1.1
einops==0.6.1
fairseq @ file:///data/git/ONE-PEACE/fairseq
filelock==3.13.1
fsspec==2023.10.0
huggingface-hub==0.19.4
hydra-core==1.0.7
idna==3.4
iopath==0.1.10
Jinja2==3.1.2
joblib==1.3.2
lazy_loader==0.3
librosa==0.10.0
llvmlite==0.41.1
lxml==4.9.3
MarkupSafe==2.1.3
mpmath==1.3.0
msgpack==1.0.7
networkx==3.2.1
numba==0.58.1
numpy==1.26.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
omegaconf==2.0.6
opencv-python==4.7.0.72
packaging==23.2
Pillow==8.4.0
platformdirs==4.0.0
pooch==1.8.0
portalocker==2.8.2
protobuf==3.20.3
pycparser==2.21
pydub==0.25.1
PyYAML==6.0.1
regex==2023.10.3
requests==2.28.1
sacrebleu==2.3.2
scikit-learn==1.0.2
scipy==1.11.4
soundfile==0.12.1
soxr==0.3.7
sympy==1.12
tabulate==0.9.0
tensorboardX==2.6
threadpoolctl==3.2.0
timm==0.6.11
torch==2.1.0
torchvision==0.16.1
tqdm==4.66.1
triton==2.1.0
typing_extensions==4.8.0
urllib3==1.26.18
wget==3.2
xformers==0.0.22.post7
logicwong commented 1 year ago

seems like this issue https://github.com/bastibe/python-soundfile/issues/360, can it help you? Or you can manually check whether the downloaded audio file can be read correctly.

import soundfile as sf
sf.read(path, dtype="float32")
paapu88 commented 1 year ago
import soundfile as sf
sound = sf.read('/data/vggsound/audio/test/Hd5M86oGZdw_000677.flac', dtype="float32")
print(sound)

resulted

(array([-0.0055542 , -0.01080322, -0.01672363, ..., -0.0262146 ,
       -0.01751709, -0.01742554], dtype=float32), 16000)

So it looks ok to me.

For my part, I give up with this issue.

For me, you can close this and hopefully give those demos I was hoping for in another ticket.