Closed PapaMadeleine2022 closed 9 months ago
I don't really understand what code are you running. Is it validate_cub.py
?
I don't really understand what code are you running. Is it
validate_cub.py
?
train_cub.py
What is your accelerator? Please, try both, gpu
with devices:2
and cpu
with `devices:2
What is your accelerator? Please, try both,
gpu
withdevices:2
andcpu
with `devices:2
accelerator: gpu
Does it work with cpu and multiple devices?
It seems to do not work with cpu and multiple devices.
Hmm, it's weird, because it works for me. Could you clear your env and install the latest version of OML?
I'm pretty sure the problem is in the environment or the libs versions. So, the idea above doesn't work, you can try to run it in docker. You can pull ready-to-use docker from the docker hub, see the installation section.
PS. Are you on Linux?
@AlekseySh Yes, on Linux.
My env of pip list
is :
absl-py 1.4.0
aiohttp 3.8.4
aiosignal 1.3.1
albumentations 1.3.0
antlr4-python3-runtime 4.9.3
anyio 3.6.2
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
asn1crypto 1.5.1
async-timeout 4.0.2
asynctest 0.13.0
attrs 22.2.0
backcall 0.2.0
backports.cached-property 1.0.2
beautifulsoup4 4.11.2
bleach 6.0.0
boto3 1.26.94
botocore 1.29.94
bravado 11.0.3
bravado-core 5.17.1
cached-property 1.5.2
cachetools 5.3.0
certifi 2022.12.7
cffi 1.15.1
chardet 3.0.4
charset-normalizer 3.1.0
click 8.1.3
click-plugins 1.1.1
cligj 0.7.2
colorama 0.4.6
cPython 0.0.6
ctranslate2 3.9.0
cycler 0.11.0
debugpy 1.6.6
decorator 5.1.1
deepl 1.14.0
defusedxml 0.7.1
dnspython 2.3.0
editdistance 0.6.2
einops 0.6.0
entrypoints 0.4
exceptiongroup 1.1.1
faiss 1.5.3
fastjsonschema 2.16.3
filelock 3.10.0
Fiona 1.9.1
fonttools 4.38.0
fqdn 1.5.1
freetype-py 2.3.0
frozenlist 1.3.3
fsspec 2023.1.0
future 0.18.3
gdown 4.6.4
geopandas 0.10.2
gitdb 4.0.10
GitPython 3.1.31
google-auth 2.16.2
google-auth-oauthlib 0.4.6
googletrans 4.0.0rc1
grad-cam 1.4.6
grpcio 1.51.3
h11 0.9.0
h2 3.2.0
hpack 3.0.0
hstspreload 2023.1.1
httpcore 0.9.1
httpx 0.13.3
huggingface-hub 0.13.2
hydra-core 1.2.0
hyperframe 5.2.0
idna 2.10
ImageHash 4.3.1
imageio 2.26.0
importlib-metadata 6.0.0
importlib-resources 5.12.0
iniconfig 2.0.0
ipykernel 6.16.2
ipython 7.34.0
ipython-genutils 0.2.0
ipywidgets 8.0.4
isoduration 20.11.0
jedi 0.18.2
Jinja2 3.1.2
jmespath 1.0.1
joblib 1.2.0
jsonpointer 2.3
jsonref 1.1.0
jsonschema 4.17.3
jupyter 1.0.0
jupyter_client 7.4.9
jupyter-console 6.6.3
jupyter_core 4.12.0
jupyter-server 1.23.6
jupyterlab-pygments 0.2.2
jupyterlab-widgets 3.0.5
kiwisolver 1.4.4
kornia 0.6.10
langid 1.1.6
Markdown 3.4.1
MarkupSafe 2.1.2
matplotlib 3.5.3
matplotlib-inline 0.1.6
mistune 2.0.5
monotonic 1.6
msgpack 1.0.5
multidict 6.0.4
munch 2.5.0
nbclassic 0.5.3
nbclient 0.7.2
nbconvert 7.2.10
nbformat 5.7.3
neptune-client 0.16.18
nest-asyncio 1.5.6
networkx 2.6.3
notebook 6.2.0
notebook_shim 0.2.2
numpy 1.21.6
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
oauthlib 3.2.2
omegaconf 2.2.3
open-metric-learning 0.3.13
opencv-python 4.7.0.72
opencv-python-headless 4.7.0.72
oscrypto 1.3.0
packaging 23.0
pandas 1.3.5
pandocfilters 1.5.0
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.4.0
pip 22.3.1
pkgutil_resolve_name 1.3.10
pluggy 1.0.0
prometheus-client 0.16.0
prompt-toolkit 3.0.38
protobuf 3.20.1
psutil 5.9.4
ptyprocess 0.7.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pyclipper 1.3.0.post4
pycparser 2.21
pydensecrf 1.0rc2
pyDeprecate 0.3.2
Pygments 2.14.0
PyJWT 2.6.0
pymongo 4.3.3
pyparsing 3.0.9
pyproj 3.2.1
pyrsistent 0.19.3
PySocks 1.7.1
pytest 7.2.2
python-dateutil 2.8.2
python-dotenv 0.21.1
pytorch-lightning 1.6.5
pytorch-metric-learning 2.0.1
pytz 2022.7.1
PyWavelets 1.3.0
PyYAML 6.0
pyzmq 25.0.1
qtconsole 5.4.1
QtPy 2.3.0
qudida 0.0.4
regex 2022.10.31
requests 2.28.2
requests-oauthlib 1.3.1
rfc3339-validator 0.1.4
rfc3986 1.5.0
rfc3987 1.3.8
rsa 4.9
s3transfer 0.6.0
scikit-image 0.19.3
scikit-learn 1.0.2
scipy 1.7.3
Send2Trash 1.8.0
sentencepiece 0.1.97
setuptools 65.5.1
shapely 2.0.1
simplejson 3.18.4
six 1.16.0
smmap 5.0.0
sniffio 1.3.0
soupsieve 2.4
swagger-spec-validator 3.0.3
tensorboard 2.11.2
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorboardX 2.6
terminado 0.17.1
threadpoolctl 3.1.0
tifffile 2021.11.2
tinycss2 1.2.1
tokenizers 0.13.2
tomli 2.0.1
torch 1.13.0
torch-summary 1.4.5
torchmetrics 0.11.4
torchvision 0.14.0
tornado 6.2
tqdm 4.65.0
traitlets 5.9.0
transformers 4.27.1
ttach 0.0.3
typing_extensions 4.5.0
uri-template 1.2.0
urllib3 1.26.15
validators 0.20.0
wcwidth 0.2.6
webcolors 1.12
webencodings 0.5.1
websocket-client 1.5.1
websockets 10.4
Werkzeug 2.2.3
wheel 0.38.4
widgetsnbextension 4.0.5
yarl 1.8.2
zipp 3.15.0
And now I still can not get the latest 0.3.14 by pip install -U open-metric-learning
on python3.7
env
Wierd. Can you gen it on python 3.8
?
Weird. python3.8
works well. Would you add a requirements.txt
or doc about required envs ?
With 3.7 you used pytorch-lightning 1.6.5, right? What is your lightning's version when you use 3.8, @PapaMadeleine2022 ?
I have a guess that different python versions may lead to different lightning versions, which may be a cause of the error
With 3.7 you used pytorch-lightning 1.6.5, right? What is your lightning's version when you use 3.8, @PapaMadeleine2022 ? The lightning's version is still 1.6.5 when I use python3.8
@AlekseySh The lightning's version is still 1.6.5 when I use python3.8
We need help here :) Anyone who wants to work on the issue is welcome
We no longer support python 3.7
DDP doesn't work with python3.8!!! When I set devices: 1, it train the model well. But when I set devices: 2 or devices: 4 or devices: [0,1,2,3], it shows error:
Traceback (most recent call last):
File "/workdir/xxx/metric-learning/open-metric-learning-release.0.4.0/pipelines/features_extraction/extractor_sateDiff/train_cub.py", line 10, in main_hydra
extractor_training_pipeline(cfg)
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/oml/lightning/pipelines/train.py", line 142, in extractor_training_pipeline
trainer.fit(model=pl_module)
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 199, in run
self.on_run_start(*args, **kwargs)
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 209, in on_run_start
self.epoch_loop.val_loop._reload_evaluation_dataloaders()
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 237, in _reload_evaluation_dataloaders
self.trainer.reset_val_dataloader()
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1965, in reset_val_dataloader
self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 372, in _reset_eval_dataloader
dataloaders = self._request_dataloader(mode, model=model)
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 459, in _request_dataloader
dataloader = source.dataloader()
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 532, in dataloader
return self.instance.trainer._call_lightning_module_hook(self.name, pl_module=self.instance)
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/oml/lightning/modules/ddp.py", line 34, in val_dataloader
return self._patch_loaders("val") if self.loaders_val else super(ModuleDDP, self).val_dataloader()
File "/workdir/anaconda3/envs/py3.8/lib/python3.8/site-packages/pytorch_lightning/core/hooks.py", line 599, in val_dataloader
raise MisconfigurationException("`val_dataloader` must be implemented to be used with the Lightning Trainer")
pytorch_lightning.utilities.exceptions.MisconfigurationException: `val_dataloader` must be implemented to be used with the Lightning Trainer
envs: python3.8 open-metric-learning==0.4.0
It is weird!
@PapaMadeleine2022 Hey! First of all, we significantly updated OML and its requirements (so it works with Lightning and PyTorch > 2.0), so, please, update your OML
upd: we understood the problem occurs on python 3.7, for 3.8 it works well
Hello, when I set devices: 1, it train the model well. But when I set
devices: 2
ordevices: 4
ordevices: [0,1,2,3]
, it shows error:My envs are: