OML-Team / open-metric-learning

Metric learning and retrieval pipelines, models and zoo.
https://open-metric-learning.readthedocs.io/en/latest/index.html
Apache License 2.0
860 stars 62 forks source link

DDP doesn't work with python < 3.8 #347

Closed PapaMadeleine2022 closed 9 months ago

PapaMadeleine2022 commented 1 year ago

upd: we understood the problem occurs on python 3.7, for 3.8 it works well

Hello, when I set devices: 1, it train the model well. But when I set devices: 2 or devices: 4 or devices: [0,1,2,3], it shows error:

...
  File "/conda/envs/py3.7/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 237, in _reload_evaluation_dataloaders
    self.trainer.reset_val_dataloader()
  File "/conda/envs/py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1966, in reset_val_dataloader
    RunningStage.VALIDATING, model=pl_module
  File "/conda/envs/py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 372, in _reset_eval_dataloader
    dataloaders = self._request_dataloader(mode, model=model)
  File "/conda/envs/py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 459, in _request_dataloader
    dataloader = source.dataloader()
  File "/conda/envs/py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 532, in dataloader
    return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance)
  File "/conda/envs/py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/conda/envs/py3.7/lib/python3.7/site-packages/oml/lightning/modules/ddp.py", line 34, in val_dataloader
    return self._patch_loaders("val") if self.loaders_val else super(ModuleDDP, self).val_dataloader()
  File "/conda/envs/py3.7/lib/python3.7/site-packages/pytorch_lightning/core/hooks.py", line 599, in val_dataloader
    raise MisconfigurationException("`val_dataloader` must be implemented to be used with the Lightning Trainer")
pytorch_lightning.utilities.exceptions.MisconfigurationException: `val_dataloader` must be implemented to be used with the Lightning Trainer

My envs are:

pytorch-lightning         1.6.5
torch                     1.13.0
torchmetrics              0.11.4
torchvision               0.14.0
open-metric-learning      0.3.13
AlekseySh commented 1 year ago

I don't really understand what code are you running. Is it validate_cub.py?

PapaMadeleine2022 commented 1 year ago

I don't really understand what code are you running. Is it validate_cub.py?

train_cub.py

AlekseySh commented 1 year ago

What is your accelerator? Please, try both, gpu with devices:2 and cpu with `devices:2

PapaMadeleine2022 commented 1 year ago

What is your accelerator? Please, try both, gpu with devices:2 and cpu with `devices:2

accelerator: gpu

AlekseySh commented 1 year ago

Does it work with cpu and multiple devices?

PapaMadeleine2022 commented 1 year ago

It seems to do not work with cpu and multiple devices.

AlekseySh commented 1 year ago

Hmm, it's weird, because it works for me. Could you clear your env and install the latest version of OML?

I'm pretty sure the problem is in the environment or the libs versions. So, the idea above doesn't work, you can try to run it in docker. You can pull ready-to-use docker from the docker hub, see the installation section.

PS. Are you on Linux?

PapaMadeleine2022 commented 1 year ago

@AlekseySh Yes, on Linux. My env of pip list is :


absl-py                   1.4.0
aiohttp                   3.8.4
aiosignal                 1.3.1
albumentations            1.3.0
antlr4-python3-runtime    4.9.3
anyio                     3.6.2
argon2-cffi               21.3.0
argon2-cffi-bindings      21.2.0
arrow                     1.2.3
asn1crypto                1.5.1
async-timeout             4.0.2
asynctest                 0.13.0
attrs                     22.2.0
backcall                  0.2.0
backports.cached-property 1.0.2
beautifulsoup4            4.11.2
bleach                    6.0.0
boto3                     1.26.94
botocore                  1.29.94
bravado                   11.0.3
bravado-core              5.17.1
cached-property           1.5.2
cachetools                5.3.0
certifi                   2022.12.7
cffi                      1.15.1
chardet                   3.0.4
charset-normalizer        3.1.0
click                     8.1.3
click-plugins             1.1.1
cligj                     0.7.2
colorama                  0.4.6
cPython                   0.0.6
ctranslate2               3.9.0
cycler                    0.11.0
debugpy                   1.6.6
decorator                 5.1.1
deepl                     1.14.0
defusedxml                0.7.1
dnspython                 2.3.0
editdistance              0.6.2
einops                    0.6.0
entrypoints               0.4
exceptiongroup            1.1.1
faiss                     1.5.3
fastjsonschema            2.16.3
filelock                  3.10.0
Fiona                     1.9.1
fonttools                 4.38.0
fqdn                      1.5.1
freetype-py               2.3.0
frozenlist                1.3.3
fsspec                    2023.1.0
future                    0.18.3
gdown                     4.6.4
geopandas                 0.10.2
gitdb                     4.0.10
GitPython                 3.1.31
google-auth               2.16.2
google-auth-oauthlib      0.4.6
googletrans               4.0.0rc1
grad-cam                  1.4.6
grpcio                    1.51.3
h11                       0.9.0
h2                        3.2.0
hpack                     3.0.0
hstspreload               2023.1.1
httpcore                  0.9.1
httpx                     0.13.3
huggingface-hub           0.13.2
hydra-core                1.2.0
hyperframe                5.2.0
idna                      2.10
ImageHash                 4.3.1
imageio                   2.26.0
importlib-metadata        6.0.0
importlib-resources       5.12.0
iniconfig                 2.0.0
ipykernel                 6.16.2
ipython                   7.34.0
ipython-genutils          0.2.0
ipywidgets                8.0.4
isoduration               20.11.0
jedi                      0.18.2
Jinja2                    3.1.2
jmespath                  1.0.1
joblib                    1.2.0
jsonpointer               2.3
jsonref                   1.1.0
jsonschema                4.17.3
jupyter                   1.0.0
jupyter_client            7.4.9
jupyter-console           6.6.3
jupyter_core              4.12.0
jupyter-server            1.23.6
jupyterlab-pygments       0.2.2
jupyterlab-widgets        3.0.5
kiwisolver                1.4.4
kornia                    0.6.10
langid                    1.1.6
Markdown                  3.4.1
MarkupSafe                2.1.2
matplotlib                3.5.3
matplotlib-inline         0.1.6
mistune                   2.0.5
monotonic                 1.6
msgpack                   1.0.5
multidict                 6.0.4
munch                     2.5.0
nbclassic                 0.5.3
nbclient                  0.7.2
nbconvert                 7.2.10
nbformat                  5.7.3
neptune-client            0.16.18
nest-asyncio              1.5.6
networkx                  2.6.3
notebook                  6.2.0
notebook_shim             0.2.2
numpy                     1.21.6
nvidia-cublas-cu11        11.10.3.66
nvidia-cuda-nvrtc-cu11    11.7.99
nvidia-cuda-runtime-cu11  11.7.99
nvidia-cudnn-cu11         8.5.0.96
oauthlib                  3.2.2
omegaconf                 2.2.3
open-metric-learning      0.3.13
opencv-python             4.7.0.72
opencv-python-headless    4.7.0.72
oscrypto                  1.3.0
packaging                 23.0
pandas                    1.3.5
pandocfilters             1.5.0
parso                     0.8.3
pexpect                   4.8.0
pickleshare               0.7.5
Pillow                    9.4.0
pip                       22.3.1
pkgutil_resolve_name      1.3.10
pluggy                    1.0.0
prometheus-client         0.16.0
prompt-toolkit            3.0.38
protobuf                  3.20.1
psutil                    5.9.4
ptyprocess                0.7.0
pyasn1                    0.4.8
pyasn1-modules            0.2.8
pyclipper                 1.3.0.post4
pycparser                 2.21
pydensecrf                1.0rc2
pyDeprecate               0.3.2
Pygments                  2.14.0
PyJWT                     2.6.0
pymongo                   4.3.3
pyparsing                 3.0.9
pyproj                    3.2.1
pyrsistent                0.19.3
PySocks                   1.7.1
pytest                    7.2.2
python-dateutil           2.8.2
python-dotenv             0.21.1
pytorch-lightning         1.6.5
pytorch-metric-learning   2.0.1
pytz                      2022.7.1
PyWavelets                1.3.0
PyYAML                    6.0
pyzmq                     25.0.1
qtconsole                 5.4.1
QtPy                      2.3.0
qudida                    0.0.4
regex                     2022.10.31
requests                  2.28.2
requests-oauthlib         1.3.1
rfc3339-validator         0.1.4
rfc3986                   1.5.0
rfc3987                   1.3.8
rsa                       4.9
s3transfer                0.6.0
scikit-image              0.19.3
scikit-learn              1.0.2
scipy                     1.7.3
Send2Trash                1.8.0
sentencepiece             0.1.97
setuptools                65.5.1
shapely                   2.0.1
simplejson                3.18.4
six                       1.16.0
smmap                     5.0.0
sniffio                   1.3.0
soupsieve                 2.4
swagger-spec-validator    3.0.3
tensorboard               2.11.2
tensorboard-data-server   0.6.1
tensorboard-plugin-wit    1.8.1
tensorboardX              2.6
terminado                 0.17.1
threadpoolctl             3.1.0
tifffile                  2021.11.2
tinycss2                  1.2.1
tokenizers                0.13.2
tomli                     2.0.1
torch                     1.13.0
torch-summary             1.4.5
torchmetrics              0.11.4
torchvision               0.14.0
tornado                   6.2
tqdm                      4.65.0
traitlets                 5.9.0
transformers              4.27.1
ttach                     0.0.3
typing_extensions         4.5.0
uri-template              1.2.0
urllib3                   1.26.15
validators                0.20.0
wcwidth                   0.2.6
webcolors                 1.12
webencodings              0.5.1
websocket-client          1.5.1
websockets                10.4
Werkzeug                  2.2.3
wheel                     0.38.4
widgetsnbextension        4.0.5
yarl                      1.8.2
zipp                      3.15.0

And now I still can not get the latest 0.3.14 by pip install -U open-metric-learning on python3.7 env

AlekseySh commented 1 year ago

Wierd. Can you gen it on python 3.8?

PapaMadeleine2022 commented 1 year ago

Weird. python3.8 works well. Would you add a requirements.txt or doc about required envs ?

AlekseySh commented 1 year ago

With 3.7 you used pytorch-lightning 1.6.5, right? What is your lightning's version when you use 3.8, @PapaMadeleine2022 ?

AlekseySh commented 1 year ago

I have a guess that different python versions may lead to different lightning versions, which may be a cause of the error

PapaMadeleine2022 commented 1 year ago

With 3.7 you used pytorch-lightning 1.6.5, right? What is your lightning's version when you use 3.8, @PapaMadeleine2022 ? The lightning's version is still 1.6.5 when I use python3.8

PapaMadeleine2022 commented 1 year ago

@AlekseySh The lightning's version is still 1.6.5 when I use python3.8

AlekseySh commented 1 year ago

We need help here :) Anyone who wants to work on the issue is welcome

AlekseySh commented 9 months ago

We no longer support python 3.7

PapaMadeleine2022 commented 8 months ago

DDP doesn't work with python3.8!!! When I set devices: 1, it train the model well. But when I set devices: 2 or devices: 4 or devices: [0,1,2,3], it shows error:

Traceback (most recent call last):
  File "/workdir/xxx/metric-learning/open-metric-learning-release.0.4.0/pipelines/features_extraction/extractor_sateDiff/train_cub.py", line 10, in main_hydra
    extractor_training_pipeline(cfg)
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/oml/lightning/pipelines/train.py", line 142, in extractor_training_pipeline
    trainer.fit(model=pl_module)
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 199, in run
    self.on_run_start(*args, **kwargs)
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 209, in on_run_start
    self.epoch_loop.val_loop._reload_evaluation_dataloaders()
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 237, in _reload_evaluation_dataloaders
    self.trainer.reset_val_dataloader()
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1965, in reset_val_dataloader
    self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 372, in _reset_eval_dataloader
    dataloaders = self._request_dataloader(mode, model=model)
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 459, in _request_dataloader
    dataloader = source.dataloader()
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 532, in dataloader
    return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance)
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/oml/lightning/modules/ddp.py", line 34, in val_dataloader
    return self._patch_loaders("val") if self.loaders_val else super(ModuleDDP, self).val_dataloader()
  File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 599, in val_dataloader
    raise MisconfigurationException("`val_dataloader` must be implemented to be used with the Lightning Trainer")
pytorch_lightning.utilities.exceptions.MisconfigurationException: `val_dataloader` must be implemented to be used with the Lightning Trainer

envs: python3.8 open-metric-learning==0.4.0

It is weird!

AlekseySh commented 8 months ago

@PapaMadeleine2022 Hey! First of all, we significantly updated OML and its requirements (so it works with Lightning and PyTorch > 2.0), so, please, update your OML