how to train on one-gpu ?

ThomaswellY commented 2 months ago

Hi , Thanks for your owesome work ! i was doing experiment on your code, and found there is no one-gpu training shell available. So i excute Retrieval.py to debug, and found the process group was necessaryly need to be initialized. Does that means one-gpu training was not supported on your orginal code? If there is a quick way to start training on one-gpu?

Sincerely looking for your reply~

ErgastiAlex commented 2 months ago

Hi, it is possible to train on a single gpu, just set --nproc_per_node=1

Alex

ThomaswellY commented 2 months ago

thanks for your help! But i just wanna excute Retrieval.py so as to debug one line by one line. even i set distributed to 'False', when perform on train_stats = train(model, train_loader, optimizer, tokenizer, epoch, warmup_steps, device, lr_scheduler, config), there is BUG report on the last line : @torch.no_grad() def concat_all_gather(tensor): """ Performs all_gather operation on the provided tensors. Warning : torch.distributed.all_gather has no gradient. """ tensors_gather = [torch.oneslike(tensor) for in range(torch.distributed.get_world_size())] and details were : Default process group has not been initialized, please make sure to call init_process_group. So i guess, i should fix something to skip the Default process initialization.

ErgastiAlex commented 2 months ago

Could you please provide more info about your environment? I've tried the code right now and i do not have any error. My launch script is:

source activate pytorch-GAN
python -m torch.distributed.run --nproc_per_node=1 --rdzv_endpoint=127.0.0.1:29501 \
Retrieval.py \
--config configs/PS_cuhk_pedes.yaml \
--output_dir output/cuhk-pedes/\
--eval_mAP \
--checkpoint /home/user/projects/MARS/checkpoint/ALBEF.pth

My conda env has the following packages (not all of them are mandatory, It is just a test env full of packages) and since in this conda i have installed the new transformer package version i had to change in xbert.py file each tokenizer_class with processor_class.

Package                  Version
------------------------ ----------
absl-py                  2.1.0
accelerate               0.18.0
aiohappyeyeballs         2.4.0
aiohttp                  3.10.5
aiosignal                1.3.1
annotated-types          0.6.0
antlr4-python3-runtime   4.9.3
anyio                    4.3.0
archspec                 0.2.1
asttokens                2.4.1
async-timeout            4.0.3
attrs                    24.2.0
audioread                3.0.1
av                       12.0.0
backcall                 0.2.0
beartype                 0.16.4
blis                     0.7.11
blobfile                 2.1.1
boltons                  23.0.0
Brotli                   1.0.9
cachetools               5.3.2
catalogue                2.0.10
certifi                  2024.2.2
cffi                     1.15.1
charset-normalizer       2.0.4
clean-fid                0.1.35
click                    8.1.7
clip                     1.0
cloudpathlib             0.16.0
comm                     0.1.4
conda                    23.11.0
conda-libmamba-solver    23.11.0
conda-package-handling   2.2.0
conda_package_streaming  0.9.0
confection               0.1.4
contourpy                1.2.0
cryptography             41.0.3
cycler                   0.12.1
cymem                    2.0.8
datasets                 2.1.0
debugpy                  1.6.7
decorator                4.4.2
diffusers                0.18.2
dill                     0.3.8
distro                   1.8.0
docker-pycreds           0.4.0
dominate                 2.9.1
einops                   0.6.1
ema-pytorch              0.3.1
en-core-web-sm           3.7.1
entrypoints              0.4
exceptiongroup           1.2.0
executing                2.0.1
fairscale                0.4.13
filelock                 3.13.1
fonttools                4.45.1
frozenlist               1.4.1
fsspec                   2023.10.0
ftfy                     6.1.3
gitdb                    4.0.11
GitPython                3.1.43
gmpy2                    2.1.2
google-auth              2.27.0
google-auth-oauthlib     1.2.0
groq                     0.5.0
grpcio                   1.60.0
h11                      0.14.0
h5py                     3.8.0
httpcore                 1.0.5
httpx                    0.27.0
huggingface-hub          0.13.3
idna                     3.4
imageio                  2.34.1
imageio-ffmpeg           0.5.1
importlib-metadata       6.3.0
importlib-resources      6.1.1
ipykernel                6.26.0
ipython                  8.12.0
jedi                     0.19.1
Jinja2                   3.1.2
joblib                   1.4.0
jsonpatch                1.32
jsonpointer              2.1
jupyter-client           7.3.4
jupyter_core             5.5.0
kiwisolver               1.4.5
langcodes                3.4.0
language_data            1.2.0
lazy_loader              0.4
libmambapy               1.5.3
librosa                  0.9.2
lightning-utilities      0.9.0
llvmlite                 0.42.0
lxml                     4.9.4
marisa-trie              1.1.0
Markdown                 3.5.2
MarkupSafe               2.1.1
matplotlib               3.5.2
matplotlib-inline        0.1.6
menuinst                 2.0.0
mkl-fft                  1.3.8
mkl-random               1.2.4
mkl-service              2.4.0
more-itertools           10.2.0
moviepy                  1.0.3
mpi4py                   3.1.4
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
murmurhash               1.0.10
natsort                  8.4.0
nest-asyncio             1.5.8
networkx                 3.1
ninja                    1.11.1.1
nltk                     3.8.1
numba                    0.59.1
numpy                    1.23.0
nvidia-cublas-cu11       11.10.3.66
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11        8.5.0.96
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.18.1
nvidia-nvjitlink-cu12    12.3.101
nvidia-nvtx-cu12         12.1.105
oauthlib                 3.2.2
omegaconf                2.3.0
openai-whisper           20231117
opencv-python            4.8.1.78
packaging                23.1
pandas                   1.4.1
parso                    0.8.3
pathtools                0.1.2
pexpect                  4.8.0
pickleshare              0.7.5
Pillow                   10.1.0
pip                      23.3
platformdirs             4.1.0
pluggy                   1.0.0
pooch                    1.8.2
preshed                  3.0.9
proglog                  0.1.10
progressbar33            2.4
promise                  2.3
prompt-toolkit           3.0.41
protobuf                 3.20.3
psutil                   5.9.1
ptyprocess               0.7.0
pure-eval                0.2.2
pyarrow                  17.0.0
pyasn1                   0.5.1
pyasn1-modules           0.3.0
pyav                     12.0.5
pycocoevalcap            1.2
pycocotools              2.0.7
pycosat                  0.6.6
pycparser                2.21
pycryptodomex            3.20.0
pydantic                 2.7.1
pydantic_core            2.18.2
Pygments                 2.17.2
pyOpenSSL                23.2.0
pyparsing                3.1.1
PySocks                  1.7.1
python-dateutil          2.8.2
pytorch-fid              0.3.0
pytorch-lightning        2.1.3
pytz                     2024.1
PyWavelets               1.6.0
PyYAML                   6.0.1
pyzmq                    23.0.0
regex                    2023.10.3
requests                 2.31.0
requests-oauthlib        1.3.1
resampy                  0.4.2
responses                0.18.0
rsa                      4.9
ruamel.yaml              0.17.21
ruamel.yaml.clib         0.2.6
ruamel-yaml-conda        0.17.21
sacremoses               0.1.1
safetensors              0.4.0
scikit-image             0.19.3
scikit-learn             1.2.2
scikit-video             1.1.11
scipy                    1.8.0
seaborn                  0.13.2
semantic-version         2.10.0
sentencepiece            0.2.0
sentry-sdk               2.13.0
setproctitle             1.3.3
setuptools               68.0.0
setuptools-rust          1.9.0
shortuuid                1.0.13
six                      1.16.0
smart-open               6.4.0
smmap                    5.0.1
sniffio                  1.3.1
soundfile                0.12.1
spacy                    3.7.4
spacy-legacy             3.0.12
spacy-loggers            1.0.5
srsly                    2.4.8
ssr-eval                 0.0.6
stack-data               0.6.2
sympy                    1.11.1
tensorboard              2.15.1
tensorboard-data-server  0.7.2
tensorboardX             2.2
thinc                    8.2.3
threadpoolctl            3.5.0
tifffile                 2024.5.10
tiktoken                 0.7.0
timm                     0.9.16
tokenizers               0.13.3
tomli                    2.0.1
torch                    1.13.1
torchaudio               0.13.1
torchfile                0.1.0
torchlibrosa             0.1.0
torchmetrics             0.11.4
torchvision              0.14.1
tornado                  6.1
tqdm                     4.63.1
traitlets                5.14.0
transformers             4.27.0
triton                   2.1.0
typer                    0.9.4
typing_extensions        4.7.1
tzdata                   2024.1
urllib3                  1.26.18
wandb                    0.12.14
wasabi                   1.1.2
Wave                     0.0.2
wcwidth                  0.2.12
weasel                   0.3.4
Werkzeug                 3.0.1
wheel                    0.41.2
whisper                  1.1.10
xxhash                   3.5.0
yacs                     0.1.8
yarl                     1.9.4
zipp                     3.17.0
zstandard                0.19.0

ThomaswellY commented 2 months ago

thanks for your kind help~ I'm sure using torch.distributed to start training is successful no matter single or multiple gpus. But this standard way to run programs in the background and we can't be able to debug line by line. For me who is not familiar with your code, i prefer to just run main.py(which is Retrieval.py in your project) in vscode, thus I can observe the details( just by starting debug and setting breakpoint) I set args in launch.json as below: { "version": "0.2.0", "configurations": [

    {
        "name":"launch T2I",
        "type": "debugpy",
        "request": "launch",
        "program": "${workspaceFolder}/RaSa/Retrieval.py",

        "args": [
            "--config" , "${workspaceFolder}/RaSa/configs/PS_cuhk_pedes.yaml",
            "--output_dir" , "${workspaceFolder}/RaSa/output/cuhk-pedes/train",
            "--checkpoint" , "${workspaceFolder}/models/ALBEF/ALBEF.pth",
            "--eval_mAP",
            "--distributed" , "false"
        ],
        "env":{
            "CUDA_VISIBLE_DEVICES":"1",
            // "DISPLAY":"localhost:10.0"
        },
        "justMyCode": true,
    },
]

} I think it maybe enough to run Retrieval.py, but I just encounter the BUG, which seems like i can only using torch.distributed.run to start training and really confuse me. the detailed BUG is below: xception has occurred: RuntimeError Default process group has not been initialized, please make sure to call init_process_group. File "/media/data1/yanghao/RaSa/models/model_person_search.py", line 278, in concat_allgather for in range(torch.distributed.get_world_size())] File "/media/data1/yanghao/RaSa/models/model_person_search.py", line 218, in _dequeue_and_enqueue image_feats = concat_all_gather(image_feat) File "/media/data1/yanghao/RaSa/models/model_person_search.py", line 103, in forward self._dequeue_and_enqueue(image_feat_m, text_feat_m, idx) File "/media/data1/yanghao/RaSa/Retrieval.py", line 49, in train loss_cl, loss_pitm, loss_mlm, loss_prd, loss_mrtd = model(image1, image2, text_input1, text_input2, File "/media/data1/yanghao/RaSa/Retrieval.py", line 264, in main train_stats = train(model, train_loader, optimizer, tokenizer, epoch, warmup_steps, device, lr_scheduler, File "/media/data1/yanghao/RaSa/Retrieval.py", line 331, in main(args, config) RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

ErgastiAlex commented 2 months ago

Now it is clear, if you want to debug the code line by line you have to modify it. The concat_all_gather method only works if it is used with torch run. To debug it, you have to comment out every line of code connected to pytorch distributed data parallel, which could result in a lot of work!

Alex

ThomaswellY commented 2 months ago

Thanks ! Even your code is based on RaSa, it is still beautifully written. I think, In the initial stage of your writing code, for easier debugging and inspection, torch.distributed would probably not be involved in. If so, may I ask if you have any plans to make that early version of the code public? Thanks~

ErgastiAlex commented 2 months ago

I am sorry, but currently I do not have a debug-ready version of the code, and we do not plan to publish one soon as we are working on other projects

Alex

ErgastiAlex / MARS

how to train on one-gpu ? #2