Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.39k forks source link

Access denied to save model checkpoint on AWS S3. #18805

Closed celsofranssa closed 1 year ago

celsofranssa commented 1 year ago

Bug description

Access denied to save model checkpoint on AWS S3.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

Run a PL project with CLOUD-BASED CHECKPOINTS on AWS Sagemaker

# `default_root_dir` is the default path used for logs and checkpoints
trainer = Trainer(default_root_dir="s3://my_bucket/data/")
trainer.fit(model)

Error messages and logs

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/s3fs/core.py", line 113, in _error_wrapper
    return await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/client.py", line 383, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the CreateMultipartUpload operation: Access Denied

Environment

Current environment * CUDA: - GPU: - NVIDIA GeForce RTX 3090 - available: True - version: 11.7 * Lightning: - lightning-utilities: 0.9.0 - pytorch-lightning: 2.0.9.post0 - pytorch-lightning-bolts: 0.3.2.post1 - torch: 2.0.1 - torchmetrics: 1.0.3 * Packages: - absl-py: 1.0.0 - aiobotocore: 2.5.4 - aiohttp: 3.8.1 - aioitertools: 0.11.0 - aiosignal: 1.2.0 - antlr4-python3-runtime: 4.9.3 - appdirs: 1.4.4 - async-timeout: 4.0.2 - attrs: 23.1.0 - boto3: 1.28.63 - botocore: 1.31.63 - cachetools: 5.0.0 - certifi: 2021.10.8 - charset-normalizer: 2.0.11 - click: 8.0.3 - cloudpickle: 2.2.1 - cmake: 3.27.6 - contextlib2: 21.6.0 - dill: 0.3.7 - docker-pycreds: 0.4.0 - filelock: 3.4.2 - frozenlist: 1.3.0 - fsspec: 2023.9.2 - future: 0.18.2 - gitdb: 4.0.10 - gitpython: 3.1.37 - google-auth: 2.6.0 - google-pasta: 0.2.0 - grpcio: 1.59.0 - huggingface-hub: 0.18.0 - hydra-core: 1.3.2 - idna: 3.3 - importlib-metadata: 4.10.1 - importlib-resources: 5.4.0 - jinja2: 3.1.2 - jmespath: 1.0.1 - joblib: 1.3.2 - jsonschema: 4.19.1 - jsonschema-specifications: 2023.7.1 - lightning-utilities: 0.9.0 - lit: 17.0.2 - markdown: 3.3.6 - markupsafe: 2.1.3 - mpmath: 1.3.0 - multidict: 6.0.2 - multiprocess: 0.70.15 - networkx: 3.1 - numpy: 1.24.4 - nvidia-cublas-cu11: 11.10.3.66 - nvidia-cuda-cupti-cu11: 11.7.101 - nvidia-cuda-nvrtc-cu11: 11.7.99 - nvidia-cuda-runtime-cu11: 11.7.99 - nvidia-cudnn-cu11: 8.5.0.96 - nvidia-cufft-cu11: 10.9.0.58 - nvidia-curand-cu11: 10.2.10.91 - nvidia-cusolver-cu11: 11.4.0.1 - nvidia-cusparse-cu11: 11.7.4.91 - nvidia-nccl-cu11: 2.14.3 - nvidia-nvtx-cu11: 11.7.91 - oauthlib: 3.2.0 - omegaconf: 2.3.0 - packaging: 21.3 - pandas: 2.0.3 - pathos: 0.3.1 - pathtools: 0.1.2 - pillow: 9.0.1 - pip: 20.0.2 - pkg-resources: 0.0.0 - pkgutil-resolve-name: 1.3.10 - platformdirs: 3.11.0 - pox: 0.3.3 - ppft: 1.7.6.7 - protobuf: 3.19.4 - psutil: 5.9.5 - pyasn1: 0.4.8 - pyasn1-modules: 0.2.8 - pydeprecate: 0.3.0 - pyparsing: 3.0.7 - python-dateutil: 2.8.2 - pytorch-lightning: 2.0.9.post0 - pytorch-lightning-bolts: 0.3.2.post1 - pytz: 2023.3.post1 - pyyaml: 6.0.1 - referencing: 0.30.2 - regex: 2022.1.18 - requests: 2.31.0 - requests-oauthlib: 1.3.1 - rpds-py: 0.10.6 - rsa: 4.8 - s3fs: 2023.9.2 - s3transfer: 0.7.0 - sacremoses: 0.0.47 - safetensors: 0.4.0 - sagemaker: 2.192.0 - schema: 0.7.5 - scikit-learn: 1.3.1 - scipy: 1.10.1 - sentry-sdk: 1.32.0 - setproctitle: 1.3.3 - setuptools: 44.0.0 - six: 1.16.0 - smdebug-rulesconfig: 1.0.1 - smmap: 5.0.1 - sympy: 1.12 - tblib: 1.7.0 - tensorboard-data-server: 0.7.1 - tensorboard-plugin-wit: 1.8.1 - threadpoolctl: 3.2.0 - tokenizers: 0.13.3 - torch: 2.0.1 - torchmetrics: 1.0.3 - tqdm: 4.66.1 - transformers: 4.31.0 - triton: 2.0.0 - typing-extensions: 4.0.1 - tzdata: 2023.3 - urllib3: 1.26.17 - wandb: 0.15.12 - werkzeug: 2.0.3 - wheel: 0.34.2 - wrapt: 1.15.0 - yarl: 1.7.2 - zipp: 3.7.0 * System: - OS: Linux - architecture: - 64bit - ELF - processor: x86_64 - python: 3.8.10 - release: 5.15.0-84-generic - version: #93~20.04.1-Ubuntu SMP Wed Sep 6 16:15:40 UTC 2023

More info

No response

celsofranssa commented 1 year ago

I was able to resolve this issue by passing the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.