Closed ThomaswellY closed 2 months ago
Hi, it is possible to train on a single gpu, just set --nproc_per_node=1
Alex
thanks for your help! But i just wanna excute Retrieval.py so as to debug one line by one line. even i set distributed to 'False', when perform on train_stats = train(model, train_loader, optimizer, tokenizer, epoch, warmup_steps, device, lr_scheduler, config), there is BUG report on the last line : @torch.no_grad() def concat_all_gather(tensor): """ Performs all_gather operation on the provided tensors. Warning : torch.distributed.all_gather has no gradient. """ tensors_gather = [torch.oneslike(tensor) for in range(torch.distributed.get_world_size())] and details were : Default process group has not been initialized, please make sure to call init_process_group. So i guess, i should fix something to skip the Default process initialization.
Could you please provide more info about your environment? I've tried the code right now and i do not have any error. My launch script is:
source activate pytorch-GAN
python -m torch.distributed.run --nproc_per_node=1 --rdzv_endpoint=127.0.0.1:29501 \
Retrieval.py \
--config configs/PS_cuhk_pedes.yaml \
--output_dir output/cuhk-pedes/\
--eval_mAP \
--checkpoint /home/user/projects/MARS/checkpoint/ALBEF.pth
My conda env has the following packages (not all of them are mandatory, It is just a test env full of packages) and since in this conda i have installed the new transformer package version i had to change in xbert.py file each tokenizer_class
with processor_class
.
Package Version
------------------------ ----------
absl-py 2.1.0
accelerate 0.18.0
aiohappyeyeballs 2.4.0
aiohttp 3.10.5
aiosignal 1.3.1
annotated-types 0.6.0
antlr4-python3-runtime 4.9.3
anyio 4.3.0
archspec 0.2.1
asttokens 2.4.1
async-timeout 4.0.3
attrs 24.2.0
audioread 3.0.1
av 12.0.0
backcall 0.2.0
beartype 0.16.4
blis 0.7.11
blobfile 2.1.1
boltons 23.0.0
Brotli 1.0.9
cachetools 5.3.2
catalogue 2.0.10
certifi 2024.2.2
cffi 1.15.1
charset-normalizer 2.0.4
clean-fid 0.1.35
click 8.1.7
clip 1.0
cloudpathlib 0.16.0
comm 0.1.4
conda 23.11.0
conda-libmamba-solver 23.11.0
conda-package-handling 2.2.0
conda_package_streaming 0.9.0
confection 0.1.4
contourpy 1.2.0
cryptography 41.0.3
cycler 0.12.1
cymem 2.0.8
datasets 2.1.0
debugpy 1.6.7
decorator 4.4.2
diffusers 0.18.2
dill 0.3.8
distro 1.8.0
docker-pycreds 0.4.0
dominate 2.9.1
einops 0.6.1
ema-pytorch 0.3.1
en-core-web-sm 3.7.1
entrypoints 0.4
exceptiongroup 1.2.0
executing 2.0.1
fairscale 0.4.13
filelock 3.13.1
fonttools 4.45.1
frozenlist 1.4.1
fsspec 2023.10.0
ftfy 6.1.3
gitdb 4.0.11
GitPython 3.1.43
gmpy2 2.1.2
google-auth 2.27.0
google-auth-oauthlib 1.2.0
groq 0.5.0
grpcio 1.60.0
h11 0.14.0
h5py 3.8.0
httpcore 1.0.5
httpx 0.27.0
huggingface-hub 0.13.3
idna 3.4
imageio 2.34.1
imageio-ffmpeg 0.5.1
importlib-metadata 6.3.0
importlib-resources 6.1.1
ipykernel 6.26.0
ipython 8.12.0
jedi 0.19.1
Jinja2 3.1.2
joblib 1.4.0
jsonpatch 1.32
jsonpointer 2.1
jupyter-client 7.3.4
jupyter_core 5.5.0
kiwisolver 1.4.5
langcodes 3.4.0
language_data 1.2.0
lazy_loader 0.4
libmambapy 1.5.3
librosa 0.9.2
lightning-utilities 0.9.0
llvmlite 0.42.0
lxml 4.9.4
marisa-trie 1.1.0
Markdown 3.5.2
MarkupSafe 2.1.1
matplotlib 3.5.2
matplotlib-inline 0.1.6
menuinst 2.0.0
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
more-itertools 10.2.0
moviepy 1.0.3
mpi4py 3.1.4
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
murmurhash 1.0.10
natsort 8.4.0
nest-asyncio 1.5.8
networkx 3.1
ninja 1.11.1.1
nltk 3.8.1
numba 0.59.1
numpy 1.23.0
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.2
omegaconf 2.3.0
openai-whisper 20231117
opencv-python 4.8.1.78
packaging 23.1
pandas 1.4.1
parso 0.8.3
pathtools 0.1.2
pexpect 4.8.0
pickleshare 0.7.5
Pillow 10.1.0
pip 23.3
platformdirs 4.1.0
pluggy 1.0.0
pooch 1.8.2
preshed 3.0.9
proglog 0.1.10
progressbar33 2.4
promise 2.3
prompt-toolkit 3.0.41
protobuf 3.20.3
psutil 5.9.1
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 17.0.0
pyasn1 0.5.1
pyasn1-modules 0.3.0
pyav 12.0.5
pycocoevalcap 1.2
pycocotools 2.0.7
pycosat 0.6.6
pycparser 2.21
pycryptodomex 3.20.0
pydantic 2.7.1
pydantic_core 2.18.2
Pygments 2.17.2
pyOpenSSL 23.2.0
pyparsing 3.1.1
PySocks 1.7.1
python-dateutil 2.8.2
pytorch-fid 0.3.0
pytorch-lightning 2.1.3
pytz 2024.1
PyWavelets 1.6.0
PyYAML 6.0.1
pyzmq 23.0.0
regex 2023.10.3
requests 2.31.0
requests-oauthlib 1.3.1
resampy 0.4.2
responses 0.18.0
rsa 4.9
ruamel.yaml 0.17.21
ruamel.yaml.clib 0.2.6
ruamel-yaml-conda 0.17.21
sacremoses 0.1.1
safetensors 0.4.0
scikit-image 0.19.3
scikit-learn 1.2.2
scikit-video 1.1.11
scipy 1.8.0
seaborn 0.13.2
semantic-version 2.10.0
sentencepiece 0.2.0
sentry-sdk 2.13.0
setproctitle 1.3.3
setuptools 68.0.0
setuptools-rust 1.9.0
shortuuid 1.0.13
six 1.16.0
smart-open 6.4.0
smmap 5.0.1
sniffio 1.3.1
soundfile 0.12.1
spacy 3.7.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
srsly 2.4.8
ssr-eval 0.0.6
stack-data 0.6.2
sympy 1.11.1
tensorboard 2.15.1
tensorboard-data-server 0.7.2
tensorboardX 2.2
thinc 8.2.3
threadpoolctl 3.5.0
tifffile 2024.5.10
tiktoken 0.7.0
timm 0.9.16
tokenizers 0.13.3
tomli 2.0.1
torch 1.13.1
torchaudio 0.13.1
torchfile 0.1.0
torchlibrosa 0.1.0
torchmetrics 0.11.4
torchvision 0.14.1
tornado 6.1
tqdm 4.63.1
traitlets 5.14.0
transformers 4.27.0
triton 2.1.0
typer 0.9.4
typing_extensions 4.7.1
tzdata 2024.1
urllib3 1.26.18
wandb 0.12.14
wasabi 1.1.2
Wave 0.0.2
wcwidth 0.2.12
weasel 0.3.4
Werkzeug 3.0.1
wheel 0.41.2
whisper 1.1.10
xxhash 3.5.0
yacs 0.1.8
yarl 1.9.4
zipp 3.17.0
zstandard 0.19.0
thanks for your kind help~ I'm sure using torch.distributed to start training is successful no matter single or multiple gpus. But this standard way to run programs in the background and we can't be able to debug line by line. For me who is not familiar with your code, i prefer to just run main.py(which is Retrieval.py in your project) in vscode, thus I can observe the details( just by starting debug and setting breakpoint) I set args in launch.json as below: { "version": "0.2.0", "configurations": [
{
"name":"launch T2I",
"type": "debugpy",
"request": "launch",
"program": "${workspaceFolder}/RaSa/Retrieval.py",
"args": [
"--config" , "${workspaceFolder}/RaSa/configs/PS_cuhk_pedes.yaml",
"--output_dir" , "${workspaceFolder}/RaSa/output/cuhk-pedes/train",
"--checkpoint" , "${workspaceFolder}/models/ALBEF/ALBEF.pth",
"--eval_mAP",
"--distributed" , "false"
],
"env":{
"CUDA_VISIBLE_DEVICES":"1",
// "DISPLAY":"localhost:10.0"
},
"justMyCode": true,
},
]
}
I think it maybe enough to run Retrieval.py, but I just encounter the BUG, which seems like i can only using torch.distributed.run to start training and really confuse me. the detailed BUG is below:
xception has occurred: RuntimeError
Default process group has not been initialized, please make sure to call init_process_group.
File "/media/data1/yanghao/RaSa/models/model_person_search.py", line 278, in concat_allgather
for in range(torch.distributed.get_world_size())]
File "/media/data1/yanghao/RaSa/models/model_person_search.py", line 218, in _dequeue_and_enqueue
image_feats = concat_all_gather(image_feat)
File "/media/data1/yanghao/RaSa/models/model_person_search.py", line 103, in forward
self._dequeue_and_enqueue(image_feat_m, text_feat_m, idx)
File "/media/data1/yanghao/RaSa/Retrieval.py", line 49, in train
loss_cl, loss_pitm, loss_mlm, loss_prd, loss_mrtd = model(image1, image2, text_input1, text_input2,
File "/media/data1/yanghao/RaSa/Retrieval.py", line 264, in main
train_stats = train(model, train_loader, optimizer, tokenizer, epoch, warmup_steps, device, lr_scheduler,
File "/media/data1/yanghao/RaSa/Retrieval.py", line 331, in
Now it is clear, if you want to debug the code line by line you have to modify it. The concat_all_gather method only works if it is used with torch run. To debug it, you have to comment out every line of code connected to pytorch distributed data parallel, which could result in a lot of work!
Alex
Thanks ! Even your code is based on RaSa, it is still beautifully written. I think, In the initial stage of your writing code, for easier debugging and inspection, torch.distributed would probably not be involved in. If so, may I ask if you have any plans to make that early version of the code public? Thanks~
I am sorry, but currently I do not have a debug-ready version of the code, and we do not plan to publish one soon as we are working on other projects
Alex
Hi , Thanks for your owesome work ! i was doing experiment on your code, and found there is no one-gpu training shell available. So i excute Retrieval.py to debug, and found the process group was necessaryly need to be initialized. Does that means one-gpu training was not supported on your orginal code? If there is a quick way to start training on one-gpu?
Sincerely looking for your reply~