Closed paapu88 closed 1 year ago
Hi, we support single GPU evaluation. You can set the CUDA_VISIBLE_DEVICES=0
and GPUS_PER_NODE=1
in evaluate.sh
Yes, I had done that already but in amazon tesla v100 (a p3.2xlarge instance) My modified one_peace/run_scripts/vggsound/evaluate.sh:
#!/usr/bin/env bash
# The port for communication. Note that if you want to run multiple tasks on the same machine,
# you need to specify different port numbers.
export MASTER_PORT=6081
#export CUDA_VISIBLE_DEVICES=1,2,3
#export GPUS_PER_NODE=3
export CUDA_VISIBLE_DEVICES=0
export GPUS_PER_NODE=1
config_dir=../../run_scripts
path=../../checkpoints/finetune_vggsound.pt
task_name=vggsound
model_name=one_peace_classify
selected_cols=uniq_id,audio,text,duration
results_path=../../results/vggsound
data=/data/vggsound/vggsound_test.tsv
gen_subset='test'
torchrun --nproc_per_node=${GPUS_PER_NODE} --master_port=${MASTER_PORT} ../../evaluate.py \
--config-dir=${config_dir} \
--config-name=evaluate \
common_eval.path=${path} \
common_eval.results_path=${results_path} \
task._name=${task_name} \
model._name=${model_name} \
dataset.gen_subset=${gen_subset} \
dataset.batch_size=4 \
common.bf16=false common.memory_efficient_bf16=false \
common_eval.model_overrides="{'task': {'_name': '${task_name}', 'data': '${data}', 'selected_cols': '${selected_cols}', 'bpe_dir': '../../utils/BPE'}}"
Crashes with:
INFO:one_peace.data.tsv_reader:loaded /data/vggsound/vggsound_test.tsv
Traceback (most recent call last):
File "/home/ubuntu/git/one_peace_unikie/one_peace/run_scripts/vggsound/../../evaluate.py", line 207, in <module>
cli_main()
File "/home/ubuntu/git/one_peace_unikie/one_peace/run_scripts/vggsound/../../evaluate.py", line 203, in cli_main
distributed_utils.call_main(cfg, main)
File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/distributed/utils.py", line 356, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/distributed/utils.py", line 330, in distributed_main
main(cfg, **kwargs)
File "/home/ubuntu/git/one_peace_unikie/one_peace/run_scripts/vggsound/../../evaluate.py", line 134, in main
for sample in progress:
File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/logging/progress_bar.py", line 272, in __iter__
for i, obj in enumerate(self.iterable, start=self.n):
File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/data/iterators.py", line 57, in __next__
x = next(self._itr)
File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/data/iterators.py", line 744, in __next__
raise item
File "/data/venvs/onepeace/lib/python3.10/site-packages/fairseq/data/iterators.py", line 674, in run
for item in self._source:
File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise
raise exception
soundfile.LibsndfileError: <exception str() failed>
[2023-11-23 08:58:54,065] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1423) of binary: /data/venvs/onepeace/bin/python
Traceback (most recent call last):
File "/data/venvs/onepeace/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/venvs/onepeace/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
../../evaluate.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-11-23_08:58:54
host : ip-172-31-24-92.eu-central-1.compute.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1423)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
It's a bit strange, try adding dataset.num_workers=0
?
Tried that with no change in error. Also tried a bigger machine with 4GPU (amazon p3.8xlarge), with your original settings and with one more GPU: same result.
Could 64GB of GPU-ram be still too little?
I have used ubuntu 22.04 and python3.10 pip freeze says the following:
(onepeace) ubuntu@ip-172-31-38-68:~/git/one_peace_unikie/one_peace/run_scripts/vggsound$ pip freeze
antlr4-python3-runtime==4.8
audioread==3.0.1
bitarray==2.8.3
certifi==2023.11.17
cffi==1.16.0
charset-normalizer==2.1.1
colorama==0.4.6
Cython==3.0.5
decorator==5.1.1
einops==0.6.1
fairseq @ file:///data/git/ONE-PEACE/fairseq
filelock==3.13.1
fsspec==2023.10.0
huggingface-hub==0.19.4
hydra-core==1.0.7
idna==3.4
iopath==0.1.10
Jinja2==3.1.2
joblib==1.3.2
lazy_loader==0.3
librosa==0.10.0
llvmlite==0.41.1
lxml==4.9.3
MarkupSafe==2.1.3
mpmath==1.3.0
msgpack==1.0.7
networkx==3.2.1
numba==0.58.1
numpy==1.26.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
omegaconf==2.0.6
opencv-python==4.7.0.72
packaging==23.2
Pillow==8.4.0
platformdirs==4.0.0
pooch==1.8.0
portalocker==2.8.2
protobuf==3.20.3
pycparser==2.21
pydub==0.25.1
PyYAML==6.0.1
regex==2023.10.3
requests==2.28.1
sacrebleu==2.3.2
scikit-learn==1.0.2
scipy==1.11.4
soundfile==0.12.1
soxr==0.3.7
sympy==1.12
tabulate==0.9.0
tensorboardX==2.6
threadpoolctl==3.2.0
timm==0.6.11
torch==2.1.0
torchvision==0.16.1
tqdm==4.66.1
triton==2.1.0
typing_extensions==4.8.0
urllib3==1.26.18
wget==3.2
xformers==0.0.22.post7
seems like this issue https://github.com/bastibe/python-soundfile/issues/360, can it help you? Or you can manually check whether the downloaded audio file can be read correctly.
import soundfile as sf
sf.read(path, dtype="float32")
import soundfile as sf
sound = sf.read('/data/vggsound/audio/test/Hd5M86oGZdw_000677.flac', dtype="float32")
print(sound)
resulted
(array([-0.0055542 , -0.01080322, -0.01672363, ..., -0.0262146 ,
-0.01751709, -0.01742554], dtype=float32), 16000)
So it looks ok to me.
For my part, I give up with this issue.
For me, you can close this and hopefully give those demos I was hoping for in another ticket.
Dear developers, I'm running script
but getting errors related to torch.distributed.elastic.multiprocessing.errors.ChildFailedError
Question: I have a single gpu machine. Is it possible to run the script with a single GPU? br. Markus