Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.96k stars 3.35k forks source link

`CSVLogger` fails if `save_dir` is an s3 path #16196

Closed turian closed 1 year ago

turian commented 1 year ago

Bug description

Cloud checkpoints are cool! But I also want CSVLogger to periodically write to cloud storage. This doesn't work.

Related bug Lightning-AI/pytorch-lightning#16195 . See 'More info' at the bottom of this issue.

There are some related issues: https://github.com/Lightning-AI/lightning/pull/14325 https://github.com/Lightning-AI/lightning/issues/5935 https://github.com/Lightning-AI/lightning/issues/11769 https://github.com/Lightning-AI/lightning/issues/15539 https://github.com/Lightning-AI/lightning/issues/2318 https://github.com/Lightning-AI/lightning/issues/2161 but I haven't found this specifically.

How to reproduce the bug

Here is a google colab that replicates this and a related bag. I share the code for both because it's easier to configure the AWS credentials and see both bugs simultaneously.

Copying and pasting the most important bit (but see the colab for a full minimal replication):

from pytorch_lightning.loggers import WandbLogger

def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    logger = WandbLogger(
        project="boringbug",
        log_model="all",
    )

    model = BoringModel()
    trainer = Trainer(
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        logger=logger,
        default_root_dir = f"{BORING_BUCKET}/wandbtest/"
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)

run()

Error messages and logs

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.utilities.rank_zero:`Trainer(limit_train_batches=1)` was configured so 1 batch per epoch will be used.
INFO:pytorch_lightning.utilities.rank_zero:`Trainer(limit_val_batches=1)` was configured so 1 batch will be used.
INFO:pytorch_lightning.utilities.rank_zero:`Trainer(limit_test_batches=1)` was configured so 1 batch will be used.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     37         else:
---> 38             return trainer_fn(*args, **kwargs)
     39 

16 frames
/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py in _fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    644         )
--> 645         self._run(model, ckpt_path=self.ckpt_path)
    646 

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py in _run(self, model, ckpt_path)
   1085 
-> 1086         self._log_hyperparams()
   1087 

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py in _log_hyperparams(self)
   1155             logger.log_graph(self.lightning_module)
-> 1156             logger.save()
   1157 

/usr/local/lib/python3.8/dist-packages/lightning_utilities/core/rank_zero.py in wrapped_fn(*args, **kwargs)
     23         if rank == 0:
---> 24             return fn(*args, **kwargs)
     25         return None

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loggers/csv_logs.py in save(self)
    206         super().save()
--> 207         self.experiment.save()
    208 

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loggers/csv_logs.py in save(self)
     86         hparams_file = os.path.join(self.log_dir, self.NAME_HPARAMS_FILE)
---> 87         save_hparams_to_yaml(hparams_file, self.hparams)
     88 

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/saving.py in save_hparams_to_yaml(config_yaml, hparams, use_omegaconf)
    378     if not fs.isdir(os.path.dirname(config_yaml)):
--> 379         raise RuntimeError(f"Missing folder: {os.path.dirname(config_yaml)}.")
    380 

RuntimeError: Missing folder: s3://boringbucketjpt/csvloggerdoesntwork/lightning_logs/version_1.

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-11-025edeeafe89> in <module>
     24     trainer.test(model, dataloaders=test_data)
     25 
---> 26 run()

<ipython-input-11-025edeeafe89> in run()
     21         default_root_dir = f"{BORING_BUCKET}/csvloggertest/"
     22     )
---> 23     trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
     24     trainer.test(model, dataloaders=test_data)
     25 

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    601             raise TypeError(f"`Trainer.fit()` requires a `LightningModule`, got: {model.__class__.__qualname__}")
    602         self.strategy._lightning_module = model
--> 603         call._call_and_handle_interrupt(
    604             self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    605         )

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/call.py in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     60         trainer._call_callback_hooks("on_exception", exception)
     61         for logger in trainer.loggers:
---> 62             logger.finalize("failed")
     63         trainer._teardown()
     64         # teardown might access the stage so we reset it after

/usr/local/lib/python3.8/dist-packages/lightning_utilities/core/rank_zero.py in wrapped_fn(*args, **kwargs)
     22             raise RuntimeError("The `rank_zero_only.rank` needs to be set before use")
     23         if rank == 0:
---> 24             return fn(*args, **kwargs)
     25         return None
     26 

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loggers/csv_logs.py in finalize(self, status)
    213             # initialized there
    214             return
--> 215         self.save()
    216 
    217     @property

/usr/local/lib/python3.8/dist-packages/lightning_utilities/core/rank_zero.py in wrapped_fn(*args, **kwargs)
     22             raise RuntimeError("The `rank_zero_only.rank` needs to be set before use")
     23         if rank == 0:
---> 24             return fn(*args, **kwargs)
     25         return None
     26 

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loggers/csv_logs.py in save(self)
    205     def save(self) -> None:
    206         super().save()
--> 207         self.experiment.save()
    208 
    209     @rank_zero_only

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/loggers/csv_logs.py in save(self)
     85         """Save recorded hparams and metrics into files."""
     86         hparams_file = os.path.join(self.log_dir, self.NAME_HPARAMS_FILE)
---> 87         save_hparams_to_yaml(hparams_file, self.hparams)
     88 
     89         if not self.metrics:

/usr/local/lib/python3.8/dist-packages/pytorch_lightning/core/saving.py in save_hparams_to_yaml(config_yaml, hparams, use_omegaconf)
    377     fs = get_filesystem(config_yaml)
    378     if not fs.isdir(os.path.dirname(config_yaml)):
--> 379         raise RuntimeError(f"Missing folder: {os.path.dirname(config_yaml)}.")
    380 
    381     # convert Namespace or AD to dict

RuntimeError: Missing folder: s3://boringbucketjpt/csvloggerdoesntwork/lightning_logs/version_1.

Environment

* CUDA:
    - GPU:
        - Tesla T4
    - available:         True
    - version:           11.6
* Lightning:
    - lightning-utilities: 0.5.0
    - pytorch-lightning: 1.8.6
    - torch:             1.13.0+cu116
    - torchaudio:        0.13.0+cu116
    - torchmetrics:      0.11.0
    - torchsummary:      1.5.1
    - torchtext:         0.14.0
    - torchvision:       0.14.0+cu116
* Packages:
    - absl-py:           1.3.0
    - aeppl:             0.0.33
    - aesara:            2.7.9
    - aiobotocore:       2.4.2
    - aiohttp:           3.8.3
    - aioitertools:      0.11.0
    - aiosignal:         1.3.1
    - alabaster:         0.7.12
    - albumentations:    1.2.1
    - altair:            4.2.0
    - appdirs:           1.4.4
    - arviz:             0.12.1
    - astor:             0.8.1
    - astropy:           4.3.1
    - astunparse:        1.6.3
    - async-timeout:     4.0.2
    - atari-py:          0.2.9
    - atomicwrites:      1.4.1
    - attrs:             22.1.0
    - audioread:         3.0.0
    - autograd:          1.5
    - awscli:            1.25.60
    - babel:             2.11.0
    - backcall:          0.2.0
    - beautifulsoup4:    4.6.3
    - bleach:            5.0.1
    - blis:              0.7.9
    - bokeh:             2.3.3
    - boto3:             1.24.59
    - botocore:          1.27.59
    - branca:            0.6.0
    - bs4:               0.0.1
    - cachecontrol:      0.12.11
    - cachetools:        5.2.0
    - catalogue:         2.0.8
    - certifi:           2022.12.7
    - cffi:              1.15.1
    - cftime:            1.6.2
    - chardet:           3.0.4
    - charset-normalizer: 2.1.1
    - click:             7.1.2
    - clikit:            0.6.2
    - cloudpickle:       1.5.0
    - cmake:             3.22.6
    - cmdstanpy:         1.0.8
    - colorama:          0.3.7
    - colorcet:          3.0.1
    - colorlover:        0.3.0
    - community:         1.0.0b1
    - confection:        0.0.3
    - cons:              0.4.5
    - contextlib2:       0.5.5
    - convertdate:       2.4.0
    - crashtest:         0.3.1
    - crcmod:            1.7
    - cryptography:      38.0.4
    - cufflinks:         0.17.3
    - cupy-cuda11x:      11.0.0
    - cvxopt:            1.3.0
    - cvxpy:             1.2.2
    - cycler:            0.11.0
    - cymem:             2.0.7
    - cython:            0.29.32
    - daft:              0.0.4
    - dask:              2022.2.1
    - datascience:       0.17.5
    - db-dtypes:         1.0.5
    - debugpy:           1.0.0
    - decorator:         4.4.2
    - defusedxml:        0.7.1
    - descartes:         1.1.0
    - dill:              0.3.6
    - distributed:       2022.2.1
    - dlib:              19.24.0
    - dm-tree:           0.1.7
    - dnspython:         2.2.1
    - docker-pycreds:    0.4.0
    - docutils:          0.16
    - dopamine-rl:       1.0.5
    - earthengine-api:   0.1.335
    - easydict:          1.10
    - ecos:              2.0.10
    - editdistance:      0.5.3
    - en-core-web-sm:    3.4.1
    - entrypoints:       0.4
    - ephem:             4.1.3
    - et-xmlfile:        1.1.0
    - etils:             0.9.0
    - etuples:           0.3.8
    - fa2:               0.3.5
    - fastai:            2.7.10
    - fastcore:          1.5.27
    - fastdownload:      0.0.7
    - fastdtw:           0.3.4
    - fastjsonschema:    2.16.2
    - fastprogress:      1.0.3
    - fastrlock:         0.8.1
    - feather-format:    0.4.1
    - filelock:          3.8.2
    - firebase-admin:    5.3.0
    - fix-yahoo-finance: 0.0.22
    - flask:             1.1.4
    - flatbuffers:       1.12
    - folium:            0.12.1.post1
    - frozenlist:        1.3.3
    - fsspec:            2022.11.0
    - future:            0.16.0
    - gast:              0.4.0
    - gdal:              2.2.2
    - gdown:             4.4.0
    - gensim:            3.6.0
    - geographiclib:     1.52
    - geopy:             1.17.0
    - gin-config:        0.5.0
    - gitdb:             4.0.10
    - gitpython:         3.1.29
    - glob2:             0.7
    - google:            2.0.3
    - google-api-core:   2.8.2
    - google-api-python-client: 1.12.11
    - google-auth:       2.15.0
    - google-auth-httplib2: 0.0.4
    - google-auth-oauthlib: 0.4.6
    - google-cloud-bigquery: 3.3.6
    - google-cloud-bigquery-storage: 2.16.2
    - google-cloud-core: 2.3.2
    - google-cloud-datastore: 2.9.0
    - google-cloud-firestore: 2.7.2
    - google-cloud-language: 2.6.1
    - google-cloud-storage: 2.5.0
    - google-cloud-translate: 3.8.4
    - google-colab:      1.0.0
    - google-crc32c:     1.5.0
    - google-pasta:      0.2.0
    - google-resumable-media: 2.4.0
    - googleapis-common-protos: 1.57.0
    - googledrivedownloader: 0.4
    - graphviz:          0.10.1
    - greenlet:          2.0.1
    - grpcio:            1.51.1
    - grpcio-status:     1.48.2
    - gspread:           3.4.2
    - gspread-dataframe: 3.0.8
    - gym:               0.25.2
    - gym-notices:       0.0.8
    - h5py:              3.1.0
    - heapdict:          1.0.1
    - hijri-converter:   2.2.4
    - holidays:          0.17.2
    - holoviews:         1.14.9
    - html5lib:          1.0.1
    - httpimport:        0.5.18
    - httplib2:          0.17.4
    - httpstan:          4.6.1
    - humanize:          0.5.1
    - hyperopt:          0.1.2
    - idna:              2.10
    - imageio:           2.9.0
    - imagesize:         1.4.1
    - imbalanced-learn:  0.8.1
    - imblearn:          0.0
    - imgaug:            0.4.0
    - importlib-metadata: 5.1.0
    - importlib-resources: 5.10.1
    - imutils:           0.5.4
    - inflect:           2.1.0
    - intel-openmp:      2022.2.1
    - intervaltree:      2.1.0
    - ipykernel:         5.3.4
    - ipython:           7.9.0
    - ipython-genutils:  0.2.0
    - ipython-sql:       0.3.9
    - ipywidgets:        7.7.1
    - itsdangerous:      1.1.0
    - jax:               0.3.25
    - jaxlib:            0.3.25+cuda11.cudnn805
    - jieba:             0.42.1
    - jinja2:            2.11.3
    - jmespath:          0.9.3
    - joblib:            1.2.0
    - jpeg4py:           0.1.4
    - jsonschema:        4.3.3
    - jupyter-client:    6.1.12
    - jupyter-console:   6.1.0
    - jupyter-core:      5.1.0
    - jupyterlab-widgets: 3.0.4
    - kaggle:            1.5.12
    - kapre:             0.3.7
    - keras:             2.9.0
    - keras-preprocessing: 1.1.2
    - keras-vis:         0.4.1
    - kiwisolver:        1.4.4
    - korean-lunar-calendar: 0.3.1
    - langcodes:         3.3.0
    - libclang:          14.0.6
    - librosa:           0.8.1
    - lightgbm:          2.2.3
    - lightning-utilities: 0.5.0
    - llvmlite:          0.39.1
    - lmdb:              0.99
    - locket:            1.0.0
    - logical-unification: 0.4.5
    - lunarcalendar:     0.0.9
    - lxml:              4.9.2
    - markdown:          3.4.1
    - markupsafe:        2.0.1
    - marshmallow:       3.19.0
    - matplotlib:        3.2.2
    - matplotlib-venn:   0.11.7
    - minikanren:        1.0.3
    - missingno:         0.5.1
    - mistune:           0.8.4
    - mizani:            0.7.3
    - mkl:               2019.0
    - mlxtend:           0.14.0
    - more-itertools:    9.0.0
    - moviepy:           0.2.3.5
    - mpmath:            1.2.1
    - msgpack:           1.0.4
    - multidict:         6.0.3
    - multipledispatch:  0.6.0
    - multitasking:      0.0.11
    - murmurhash:        1.0.9
    - music21:           5.5.0
    - natsort:           5.5.0
    - nbconvert:         5.6.1
    - nbformat:          5.7.0
    - netcdf4:           1.6.2
    - networkx:          2.8.8
    - nibabel:           3.0.2
    - nltk:              3.7
    - notebook:          5.7.16
    - numba:             0.56.4
    - numexpr:           2.8.4
    - numpy:             1.21.6
    - oauth2client:      4.1.3
    - oauthlib:          3.2.2
    - okgrade:           0.4.3
    - olefile:           0.45.1
    - opencv-contrib-python: 4.6.0.66
    - opencv-python:     4.6.0.66
    - opencv-python-headless: 4.6.0.66
    - openpyxl:          3.0.10
    - opt-einsum:        3.3.0
    - osqp:              0.6.2.post0
    - packaging:         21.3
    - palettable:        3.3.0
    - pandas:            1.3.5
    - pandas-datareader: 0.9.0
    - pandas-gbq:        0.17.9
    - pandas-profiling:  1.4.1
    - pandocfilters:     1.5.0
    - panel:             0.12.1
    - param:             1.12.3
    - parso:             0.8.3
    - partd:             1.3.0
    - pastel:            0.2.1
    - pathlib:           1.0.1
    - pathtools:         0.1.2
    - pathy:             0.10.1
    - patsy:             0.5.3
    - pep517:            0.13.0
    - pexpect:           4.8.0
    - pickleshare:       0.7.5
    - pillow:            7.1.2
    - pip:               21.1.3
    - pip-tools:         6.2.0
    - platformdirs:      2.6.0
    - plotly:            5.5.0
    - plotnine:          0.8.0
    - pluggy:            0.7.1
    - pooch:             1.6.0
    - portpicker:        1.3.9
    - prefetch-generator: 1.0.3
    - preshed:           3.0.8
    - prettytable:       3.5.0
    - progressbar2:      3.38.0
    - prometheus-client: 0.15.0
    - promise:           2.3
    - prompt-toolkit:    2.0.10
    - prophet:           1.1.1
    - proto-plus:        1.22.1
    - protobuf:          3.19.6
    - psutil:            5.4.8
    - psycopg2:          2.9.5
    - ptyprocess:        0.7.0
    - py:                1.11.0
    - pyarrow:           9.0.0
    - pyasn1:            0.4.8
    - pyasn1-modules:    0.2.8
    - pycocotools:       2.0.6
    - pycparser:         2.21
    - pyct:              0.4.8
    - pydantic:          1.10.2
    - pydata-google-auth: 1.4.0
    - pydot:             1.3.0
    - pydot-ng:          2.0.0
    - pydotplus:         2.0.2
    - pydrive:           1.3.1
    - pyemd:             0.5.1
    - pyerfa:            2.0.0.1
    - pygments:          2.6.1
    - pygobject:         3.26.1
    - pylev:             1.4.0
    - pymc:              4.1.4
    - pymeeus:           0.5.12
    - pymongo:           4.3.3
    - pymystem3:         0.2.0
    - pyopengl:          3.1.6
    - pyopenssl:         22.1.0
    - pyparsing:         3.0.9
    - pyrsistent:        0.19.2
    - pysimdjson:        3.2.0
    - pysndfile:         1.3.8
    - pysocks:           1.7.1
    - pystan:            3.3.0
    - pytest:            3.6.4
    - python-apt:        0.0.0
    - python-dateutil:   2.8.2
    - python-louvain:    0.16
    - python-slugify:    7.0.0
    - python-utils:      3.4.5
    - pytorch-lightning: 1.8.6
    - pytz:              2022.6
    - pyviz-comms:       2.2.1
    - pywavelets:        1.4.1
    - pyyaml:            5.4.1
    - pyzmq:             23.2.1
    - qdldl:             0.1.5.post2
    - qudida:            0.0.4
    - regex:             2022.6.2
    - requests:          2.23.0
    - requests-oauthlib: 1.3.1
    - resampy:           0.4.2
    - roman:             2.0.0
    - rpy2:              3.5.5
    - rsa:               4.7.2
    - s3fs:              2022.11.0
    - s3transfer:        0.6.0
    - scikit-image:      0.18.3
    - scikit-learn:      1.0.2
    - scipy:             1.7.3
    - screen-resolution-extra: 0.0.0
    - scs:               3.2.2
    - seaborn:           0.11.2
    - send2trash:        1.8.0
    - sentry-sdk:        1.9.0
    - setproctitle:      1.3.2
    - setuptools:        57.4.0
    - setuptools-git:    1.2
    - shapely:           2.0.0
    - shortuuid:         1.0.11
    - six:               1.15.0
    - sklearn-pandas:    1.8.0
    - smart-open:        6.3.0
    - smmap:             5.0.0
    - snowballstemmer:   2.2.0
    - sortedcontainers:  2.4.0
    - soundfile:         0.11.0
    - spacy:             3.4.4
    - spacy-legacy:      3.0.10
    - spacy-loggers:     1.0.4
    - sphinx:            1.8.6
    - sphinxcontrib-serializinghtml: 1.1.5
    - sphinxcontrib-websupport: 1.2.4
    - sqlalchemy:        1.4.45
    - sqlparse:          0.4.3
    - srsly:             2.4.5
    - statsmodels:       0.12.2
    - sympy:             1.7.1
    - tables:            3.7.0
    - tabulate:          0.8.10
    - tblib:             1.7.0
    - tenacity:          8.1.0
    - tensorboard:       2.9.1
    - tensorboard-data-server: 0.6.1
    - tensorboard-plugin-wit: 1.8.1
    - tensorboardx:      2.5.1
    - tensorflow:        2.9.2
    - tensorflow-datasets: 4.6.0
    - tensorflow-estimator: 2.9.0
    - tensorflow-gcs-config: 2.9.1
    - tensorflow-hub:    0.12.0
    - tensorflow-io-gcs-filesystem: 0.28.0
    - tensorflow-metadata: 1.12.0
    - tensorflow-probability: 0.17.0
    - termcolor:         2.1.1
    - terminado:         0.13.3
    - testpath:          0.6.0
    - text-unidecode:    1.3
    - textblob:          0.15.3
    - thinc:             8.1.5
    - threadpoolctl:     3.1.0
    - tifffile:          2022.10.10
    - toml:              0.10.2
    - tomli:             2.0.1
    - toolz:             0.12.0
    - torch:             1.13.0+cu116
    - torchaudio:        0.13.0+cu116
    - torchmetrics:      0.11.0
    - torchsummary:      1.5.1
    - torchtext:         0.14.0
    - torchvision:       0.14.0+cu116
    - tornado:           6.0.4
    - tqdm:              4.64.1
    - traitlets:         5.7.1
    - tweepy:            3.10.0
    - typeguard:         2.7.1
    - typer:             0.7.0
    - typing-extensions: 4.4.0
    - tzlocal:           1.5.1
    - uritemplate:       3.0.1
    - urllib3:           1.25.11
    - vega-datasets:     0.9.0
    - wandb:             0.13.7
    - wasabi:            0.10.1
    - wcwidth:           0.2.5
    - webargs:           8.2.0
    - webencodings:      0.5.1
    - werkzeug:          1.0.1
    - wheel:             0.38.4
    - widgetsnbextension: 3.6.1
    - wordcloud:         1.8.2.2
    - wrapt:             1.14.1
    - xarray:            2022.12.0
    - xarray-einstats:   0.4.0
    - xgboost:           0.90
    - xkit:              0.0.0
    - xlrd:              1.2.0
    - xlwt:              1.3.0
    - yarl:              1.8.2
    - yellowbrick:       1.5
    - zict:              2.2.0
    - zipp:              3.11.0
* System:
    - OS:                Linux
    - architecture:
        - 64bit
        - 
    - processor:         x86_64
    - python:            3.8.16
    - version:           Lightning-AI/pytorch-lightning#1 SMP Fri Aug 26 08:44:51 UTC 2022

More info

What I really want for christmas this year, all packaged together:

cc @borda

turian commented 1 year ago

Fixed formatting issues in the description.

ELind77 commented 1 year ago

I also have this issue and I noticed an issue that I think is related. When I run my Trainer with default_root_dir=s3:/bucket/path/ I find in my local working directory an s3: path! It even has all of the subdirectories added.

I happened to have this working in PyCharm so I ran a quick debug session and confirmed that pl.core.saving.save_hparams_to_yaml has the correct fs argument. It does indeed have an s3 file system in there and it's able to see and access the bucket I was trying to write to.

This makes me think that some other subroutine is getting a local filesystem passed and creating the requisite paths. Since S3 is a key-value system you must have some special function to create those empty directories on S3 in order to get that check to pass, right (assuming this worked properly in the past)?

I'm very new to lightning (slowly dragging myself away from all of my very old keras code) so I don't know the code base well enough to just dive in and patch this but this is definitely a serious irritant to my workflow.

ELind77 commented 1 year ago

Why is this labeled as a feature and not a bug?

ELind77 commented 1 year ago

Thank you @Borda. I appreciate that.

carmocca commented 1 year ago

This is considered a feature and not a bug because fsspec support for the csv logger is not implemented. If it was, but it wasn't working properly then we would consider it a bug

ELind77 commented 1 year ago

Oh, that's interesting. Maybe the documentation needs to be changed then? I was going based on the Remote Filesystems documentation page which has this example at the top of the page:

# `default_root_dir` is the default path used for logs and checkpoints
trainer = Trainer(default_root_dir="s3://my_bucket/data/")
trainer.fit(model)

If my understanding is correct, that example should use the CSVLogger by default, right (as stated in the Trainer API docs)?

Thank you very much for the fast PR though. Maybe it's fixing a docs bug and adding a new feature?

carmocca commented 1 year ago

I understand your confusion now. We changed the default logger from TensorBoardLogger to CSVLogger recently: Lightning-AI/pytorch-lightning#9900. TensorBoard did support fsspec, but CSVLogger didn't. So you are correct that the docs are incorrect until Lightning-AI/pytorch-lightning#16880 is merged