Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.42k stars 3.39k forks source link

Symlink last checkpoint will fail on windows due to permission error #18900

Closed aweinmann closed 1 year ago

aweinmann commented 1 year ago

Bug description

Bug description

Hi, with lightning v2.1 on Windows the creation of the symlink for the "last" ckpt will result in a permission error. This is because on Windows creating a (soft) symlink requires admin permission.

What version are you seeing the problem on?

v2.1

How to reproduce the bug

Train any model with ModelCheckpoint Callback (save_last=True and save_top_k!=0) on Windows without admin permissions

Error messages and logs

lightning\pytorch\callbacks\model_checkpoint.py", line 388, in _link_checkpoint
    os.symlink(filepath, linkpath)
OSError: [WinError 1314] A required privilege is not held by the client

Environment

Current environment * CUDA: - GPU: - NVIDIA GeForce RTX 3080 Ti Laptop GPU - available: True - version: 12.1 * Lightning: - lightning: 2.1.0 - lightning-cloud: 0.5.37 - lightning-utilities: 0.9.0 - pytorch-lightning: 2.1.0 - pytorch-optimizer: 2.11.1 - torch: 2.1.0+cu121 - torchaudio: 2.1.0+cu121 - torchmetrics: 1.2.0 - torchvision: 0.16.0+cu121 * Packages: - absl-py: 1.4.0 - aiohttp: 3.8.5 - aiosignal: 1.3.1 - alembic: 1.11.1 - ansicon: 1.89.0 - antlr4-python3-runtime: 4.9.3 - anyio: 3.7.1 - arrow: 1.2.3 - asttokens: 2.2.1 - async-timeout: 4.0.2 - attrs: 23.1.0 - autopage: 0.5.1 - av: 10.0.0 - backcall: 0.2.0 - beautifulsoup4: 4.12.2 - black: 23.7.0 - blessed: 1.20.0 - cachetools: 5.3.1 - certifi: 2022.12.7 - cfgv: 3.3.1 - charset-normalizer: 2.1.1 - click: 8.1.6 - cliff: 4.3.0 - cmaes: 0.10.0 - cmd2: 2.4.3 - colorama: 0.4.6 - colorlog: 6.7.0 - comm: 0.1.3 - contourpy: 1.1.0 - croniter: 1.3.15 - cycler: 0.11.0 - dateutils: 0.6.12 - debugpy: 1.6.7 - decorator: 5.1.1 - deepdiff: 6.3.1 - distlib: 0.3.7 - exceptiongroup: 1.1.2 - executing: 1.2.0 - fastapi: 0.100.0 - fastcore: 1.5.29 - filelock: 3.12.2 - fonttools: 4.41.0 - frozenlist: 1.4.0 - fsspec: 2023.6.0 - furl: 2.1.3 - google-auth: 2.22.0 - google-auth-oauthlib: 1.0.0 - greenlet: 2.0.2 - grpcio: 1.56.2 - h11: 0.14.0 - h5py: 3.9.0 - hydra-colorlog: 1.2.0 - hydra-core: 1.3.2 - hydra-joblib-launcher: 1.2.0 - hydra-optuna-sweeper: 1.2.0 - identify: 2.5.25 - idna: 3.4 - imageio: 2.31.6 - importlib-metadata: 6.8.0 - iniconfig: 2.0.0 - inquirer: 3.1.3 - ipykernel: 6.24.0 - ipython: 8.14.0 - itsdangerous: 2.1.2 - jedi: 0.18.2 - jinja2: 3.1.2 - jinxed: 1.2.0 - joblib: 1.3.1 - jsonschema: 4.18.4 - jsonschema-specifications: 2023.7.1 - jupyter-client: 8.3.0 - jupyter-core: 5.3.1 - kiwisolver: 1.4.4 - lightning: 2.1.0 - lightning-cloud: 0.5.37 - lightning-utilities: 0.9.0 - mako: 1.2.4 - markdown: 3.4.3 - markdown-it-py: 3.0.0 - markupsafe: 2.1.2 - matplotlib: 3.8.0 - matplotlib-inline: 0.1.6 - mdurl: 0.1.2 - mplcyberpunk: 0.7.0 - mpmath: 1.2.1 - multidict: 6.0.4 - mypy-extensions: 1.0.0 - nest-asyncio: 1.5.6 - networkx: 3.0 - nodeenv: 1.8.0 - numpy: 1.26.0 - oauthlib: 3.2.2 - omegaconf: 2.3.0 - openexr: 1.3.8 - optuna: 2.10.1 - ordered-set: 4.1.0 - orderedmultidict: 1.0.1 - packaging: 23.1 - pandas: 2.1.2 - parso: 0.8.3 - pathlib2: 2.3.7.post1 - pathspec: 0.11.2 - pbr: 5.11.1 - pefile: 2023.2.7 - pickleshare: 0.7.5 - pillow: 9.3.0 - pip: 23.3.1 - platformdirs: 3.9.1 - pluggy: 1.2.0 - pre-commit: 3.5.0 - prettytable: 3.8.0 - prompt-toolkit: 3.0.39 - protobuf: 4.23.4 - psutil: 5.9.5 - pure-eval: 0.2.2 - pyasn1: 0.5.0 - pyasn1-modules: 0.3.0 - pydantic: 1.10.11 - pygments: 2.15.1 - pyjwt: 2.4.0 - pyparsing: 3.0.9 - pyperclip: 1.8.2 - pyreadline3: 3.4.1 - pyroexr: 0.2.0 - pyrootutils: 1.0.4 - pytest: 7.4.3 - python-dateutil: 2.8.2 - python-dotenv: 1.0.0 - python-editor: 1.0.4 - python-multipart: 0.0.6 - pytorch-lightning: 2.1.0 - pytorch-optimizer: 2.11.1 - pytz: 2023.3 - pywin32: 306 - pyyaml: 6.0.1 - pyzmq: 25.1.0 - readchar: 4.0.5 - referencing: 0.30.0 - requests: 2.28.1 - requests-oauthlib: 1.3.1 - rich: 13.6.0 - rpds-py: 0.9.2 - rsa: 4.9 - scipy: 1.11.1 - seaborn: 0.13.0 - setuptools: 65.5.0 - six: 1.16.0 - sniffio: 1.3.0 - soupsieve: 2.4.1 - sqlalchemy: 2.0.19 - stack-data: 0.6.2 - starlette: 0.27.0 - starsessions: 1.3.0 - stevedore: 5.1.0 - sympy: 1.11.1 - tensorboard: 2.15.0 - tensorboard-data-server: 0.7.1 - tomli: 2.0.1 - torch: 2.1.0+cu121 - torchaudio: 2.1.0+cu121 - torchmetrics: 1.2.0 - torchvision: 0.16.0+cu121 - tornado: 6.3.2 - tqdm: 4.66.1 - traitlets: 5.9.0 - typing-extensions: 4.7.1 - tzdata: 2023.3 - urllib3: 1.26.13 - uvicorn: 0.23.1 - virtualenv: 20.24.1 - wcwidth: 0.2.6 - websocket-client: 1.6.1 - websockets: 11.0.3 - werkzeug: 2.3.6 - wheel: 0.40.0 - yarl: 1.9.2 - zipp: 3.16.2 * System: - OS: Windows - architecture: - 64bit - WindowsPE - processor: Intel64 Family 6 Model 154 Stepping 3, GenuineIntel - python: 3.10.11 - release: 10 - version: 10.0.22621

More info

No response

cc @carmocca @awaelchli

awaelchli commented 1 year ago

Hey @aweinmann I'm sorry about that, we shouldn't have pushed this feature in last minute. Our CI runners must have extra permissions set so that symlink creation works, otherwise we would have seen a permission error in our Windows test suite too.

Since it is not possible to reliably create a symlink on Windows without elevated permissions, we will have to guard the symlink creation and fall back to saving a copy of the file when permissions are not granted :(