Closed turian closed 1 year ago
Fixed formatting issues in the description.
I also have this issue and I noticed an issue that I think is related. When I run my Trainer
with default_root_dir=s3:/bucket/path/
I find in my local working directory an s3:
path! It even has all of the subdirectories added.
I happened to have this working in PyCharm so I ran a quick debug session and confirmed that pl.core.saving.save_hparams_to_yaml
has the correct fs
argument. It does indeed have an s3 file system in there and it's able to see and access the bucket I was trying to write to.
This makes me think that some other subroutine is getting a local filesystem passed and creating the requisite paths. Since S3 is a key-value system you must have some special function to create those empty directories on S3 in order to get that check to pass, right (assuming this worked properly in the past)?
I'm very new to lightning (slowly dragging myself away from all of my very old keras code) so I don't know the code base well enough to just dive in and patch this but this is definitely a serious irritant to my workflow.
Why is this labeled as a feature and not a bug?
Thank you @Borda. I appreciate that.
This is considered a feature and not a bug because fsspec support for the csv logger is not implemented. If it was, but it wasn't working properly then we would consider it a bug
Oh, that's interesting. Maybe the documentation needs to be changed then? I was going based on the Remote Filesystems documentation page which has this example at the top of the page:
# `default_root_dir` is the default path used for logs and checkpoints
trainer = Trainer(default_root_dir="s3://my_bucket/data/")
trainer.fit(model)
If my understanding is correct, that example should use the CSVLogger
by default, right (as stated in the Trainer API docs)?
Thank you very much for the fast PR though. Maybe it's fixing a docs bug and adding a new feature?
I understand your confusion now. We changed the default logger from TensorBoardLogger to CSVLogger recently: Lightning-AI/pytorch-lightning#9900. TensorBoard did support fsspec, but CSVLogger didn't. So you are correct that the docs are incorrect until Lightning-AI/pytorch-lightning#16880 is merged
Bug description
Cloud checkpoints are cool! But I also want CSVLogger to periodically write to cloud storage. This doesn't work.
Related bug Lightning-AI/pytorch-lightning#16195 . See 'More info' at the bottom of this issue.
There are some related issues: https://github.com/Lightning-AI/lightning/pull/14325 https://github.com/Lightning-AI/lightning/issues/5935 https://github.com/Lightning-AI/lightning/issues/11769 https://github.com/Lightning-AI/lightning/issues/15539 https://github.com/Lightning-AI/lightning/issues/2318 https://github.com/Lightning-AI/lightning/issues/2161 but I haven't found this specifically.
How to reproduce the bug
Here is a google colab that replicates this and a related bag. I share the code for both because it's easier to configure the AWS credentials and see both bugs simultaneously.
Copying and pasting the most important bit (but see the colab for a full minimal replication):
Error messages and logs
Environment
More info
What I really want for christmas this year, all packaged together:
trainer.default_root_dir
that also saves checkpoints to s3.cc @borda