Cloud checkpoints are cool! But once you use the WandbLogger, no cloud checkpoints (or anything really) is saved to trainer.default_root_dir. The model is checkpointed as a Wandb artifact, which is cool, but I want it also in trainer.default_root_dir's s3 bucket.
There reason I want this:
wandb checkpoints are good if you want to go back and find something from six months ago.
However, they are a pain to use if you are in back-to-back experimental cycle, rather than just remembering the S3 location and using it. Additionally it is incompatible with @skypilot-org storage, which is a much cleaner idiom / pattern.
Related bug Lightning-AI/pytorch-lightning#16196 . See 'More info' at the bottom of this issue.
Here is a google colab that replicates this and a related bag. I share the code for both because it's easier to configure the AWS credentials and see both bugs simultaneously.
Copying and pasting the most important bit (but see the colab for a full minimal replication):
### Error messages and logs
There is no error message, but `{BORING_BUCKET}/wandbtest/` (an S3 location) is empty, and the checkpoint is only in Wandb.
### Environment
CUDA:
GPU:
Tesla T4
available: True
version: 11.6
Lightning:
lightning-utilities: 0.5.0
pytorch-lightning: 1.8.6
torch: 1.13.0+cu116
torchaudio: 0.13.0+cu116
torchmetrics: 0.11.0
torchsummary: 1.5.1
torchtext: 0.14.0
torchvision: 0.14.0+cu116
Packages:
absl-py: 1.3.0
aeppl: 0.0.33
aesara: 2.7.9
aiobotocore: 2.4.2
aiohttp: 3.8.3
aioitertools: 0.11.0
aiosignal: 1.3.1
alabaster: 0.7.12
albumentations: 1.2.1
altair: 4.2.0
appdirs: 1.4.4
arviz: 0.12.1
astor: 0.8.1
astropy: 4.3.1
astunparse: 1.6.3
async-timeout: 4.0.2
atari-py: 0.2.9
atomicwrites: 1.4.1
attrs: 22.1.0
audioread: 3.0.0
autograd: 1.5
awscli: 1.25.60
babel: 2.11.0
backcall: 0.2.0
beautifulsoup4: 4.6.3
bleach: 5.0.1
blis: 0.7.9
bokeh: 2.3.3
boto3: 1.24.59
botocore: 1.27.59
branca: 0.6.0
bs4: 0.0.1
cachecontrol: 0.12.11
cachetools: 5.2.0
catalogue: 2.0.8
certifi: 2022.12.7
cffi: 1.15.1
cftime: 1.6.2
chardet: 3.0.4
charset-normalizer: 2.1.1
click: 7.1.2
clikit: 0.6.2
cloudpickle: 1.5.0
cmake: 3.22.6
cmdstanpy: 1.0.8
colorama: 0.3.7
colorcet: 3.0.1
colorlover: 0.3.0
community: 1.0.0b1
confection: 0.0.3
cons: 0.4.5
contextlib2: 0.5.5
convertdate: 2.4.0
crashtest: 0.3.1
crcmod: 1.7
cryptography: 38.0.4
cufflinks: 0.17.3
cupy-cuda11x: 11.0.0
cvxopt: 1.3.0
cvxpy: 1.2.2
cycler: 0.11.0
cymem: 2.0.7
cython: 0.29.32
daft: 0.0.4
dask: 2022.2.1
datascience: 0.17.5
db-dtypes: 1.0.5
debugpy: 1.0.0
decorator: 4.4.2
defusedxml: 0.7.1
descartes: 1.1.0
dill: 0.3.6
distributed: 2022.2.1
dlib: 19.24.0
dm-tree: 0.1.7
dnspython: 2.2.1
docker-pycreds: 0.4.0
docutils: 0.16
dopamine-rl: 1.0.5
earthengine-api: 0.1.335
easydict: 1.10
ecos: 2.0.10
editdistance: 0.5.3
en-core-web-sm: 3.4.1
entrypoints: 0.4
ephem: 4.1.3
et-xmlfile: 1.1.0
etils: 0.9.0
etuples: 0.3.8
fa2: 0.3.5
fastai: 2.7.10
fastcore: 1.5.27
fastdownload: 0.0.7
fastdtw: 0.3.4
fastjsonschema: 2.16.2
fastprogress: 1.0.3
fastrlock: 0.8.1
feather-format: 0.4.1
filelock: 3.8.2
firebase-admin: 5.3.0
fix-yahoo-finance: 0.0.22
flask: 1.1.4
flatbuffers: 1.12
folium: 0.12.1.post1
frozenlist: 1.3.3
fsspec: 2022.11.0
future: 0.16.0
gast: 0.4.0
gdal: 2.2.2
gdown: 4.4.0
gensim: 3.6.0
geographiclib: 1.52
geopy: 1.17.0
gin-config: 0.5.0
gitdb: 4.0.10
gitpython: 3.1.29
glob2: 0.7
google: 2.0.3
google-api-core: 2.8.2
google-api-python-client: 1.12.11
google-auth: 2.15.0
google-auth-httplib2: 0.0.4
google-auth-oauthlib: 0.4.6
google-cloud-bigquery: 3.3.6
google-cloud-bigquery-storage: 2.16.2
google-cloud-core: 2.3.2
google-cloud-datastore: 2.9.0
google-cloud-firestore: 2.7.2
google-cloud-language: 2.6.1
google-cloud-storage: 2.5.0
google-cloud-translate: 3.8.4
google-colab: 1.0.0
google-crc32c: 1.5.0
google-pasta: 0.2.0
google-resumable-media: 2.4.0
googleapis-common-protos: 1.57.0
googledrivedownloader: 0.4
graphviz: 0.10.1
greenlet: 2.0.1
grpcio: 1.51.1
grpcio-status: 1.48.2
gspread: 3.4.2
gspread-dataframe: 3.0.8
gym: 0.25.2
gym-notices: 0.0.8
h5py: 3.1.0
heapdict: 1.0.1
hijri-converter: 2.2.4
holidays: 0.17.2
holoviews: 1.14.9
html5lib: 1.0.1
httpimport: 0.5.18
httplib2: 0.17.4
httpstan: 4.6.1
humanize: 0.5.1
hyperopt: 0.1.2
idna: 2.10
imageio: 2.9.0
imagesize: 1.4.1
imbalanced-learn: 0.8.1
imblearn: 0.0
imgaug: 0.4.0
importlib-metadata: 5.1.0
importlib-resources: 5.10.1
imutils: 0.5.4
inflect: 2.1.0
intel-openmp: 2022.2.1
intervaltree: 2.1.0
ipykernel: 5.3.4
ipython: 7.9.0
ipython-genutils: 0.2.0
ipython-sql: 0.3.9
ipywidgets: 7.7.1
itsdangerous: 1.1.0
jax: 0.3.25
jaxlib: 0.3.25+cuda11.cudnn805
jieba: 0.42.1
jinja2: 2.11.3
jmespath: 0.9.3
joblib: 1.2.0
jpeg4py: 0.1.4
jsonschema: 4.3.3
jupyter-client: 6.1.12
jupyter-console: 6.1.0
jupyter-core: 5.1.0
jupyterlab-widgets: 3.0.4
kaggle: 1.5.12
kapre: 0.3.7
keras: 2.9.0
keras-preprocessing: 1.1.2
keras-vis: 0.4.1
kiwisolver: 1.4.4
korean-lunar-calendar: 0.3.1
langcodes: 3.3.0
libclang: 14.0.6
librosa: 0.8.1
lightgbm: 2.2.3
lightning-utilities: 0.5.0
llvmlite: 0.39.1
lmdb: 0.99
locket: 1.0.0
logical-unification: 0.4.5
lunarcalendar: 0.0.9
lxml: 4.9.2
markdown: 3.4.1
markupsafe: 2.0.1
marshmallow: 3.19.0
matplotlib: 3.2.2
matplotlib-venn: 0.11.7
minikanren: 1.0.3
missingno: 0.5.1
mistune: 0.8.4
mizani: 0.7.3
mkl: 2019.0
mlxtend: 0.14.0
more-itertools: 9.0.0
moviepy: 0.2.3.5
mpmath: 1.2.1
msgpack: 1.0.4
multidict: 6.0.3
multipledispatch: 0.6.0
multitasking: 0.0.11
murmurhash: 1.0.9
music21: 5.5.0
natsort: 5.5.0
nbconvert: 5.6.1
nbformat: 5.7.0
netcdf4: 1.6.2
networkx: 2.8.8
nibabel: 3.0.2
nltk: 3.7
notebook: 5.7.16
numba: 0.56.4
numexpr: 2.8.4
numpy: 1.21.6
oauth2client: 4.1.3
oauthlib: 3.2.2
okgrade: 0.4.3
olefile: 0.45.1
opencv-contrib-python: 4.6.0.66
opencv-python: 4.6.0.66
opencv-python-headless: 4.6.0.66
openpyxl: 3.0.10
opt-einsum: 3.3.0
osqp: 0.6.2.post0
packaging: 21.3
palettable: 3.3.0
pandas: 1.3.5
pandas-datareader: 0.9.0
pandas-gbq: 0.17.9
pandas-profiling: 1.4.1
pandocfilters: 1.5.0
panel: 0.12.1
param: 1.12.3
parso: 0.8.3
partd: 1.3.0
pastel: 0.2.1
pathlib: 1.0.1
pathtools: 0.1.2
pathy: 0.10.1
patsy: 0.5.3
pep517: 0.13.0
pexpect: 4.8.0
pickleshare: 0.7.5
pillow: 7.1.2
pip: 21.1.3
pip-tools: 6.2.0
platformdirs: 2.6.0
plotly: 5.5.0
plotnine: 0.8.0
pluggy: 0.7.1
pooch: 1.6.0
portpicker: 1.3.9
prefetch-generator: 1.0.3
preshed: 3.0.8
prettytable: 3.5.0
progressbar2: 3.38.0
prometheus-client: 0.15.0
promise: 2.3
prompt-toolkit: 2.0.10
prophet: 1.1.1
proto-plus: 1.22.1
protobuf: 3.19.6
psutil: 5.4.8
psycopg2: 2.9.5
ptyprocess: 0.7.0
py: 1.11.0
pyarrow: 9.0.0
pyasn1: 0.4.8
pyasn1-modules: 0.2.8
pycocotools: 2.0.6
pycparser: 2.21
pyct: 0.4.8
pydantic: 1.10.2
pydata-google-auth: 1.4.0
pydot: 1.3.0
pydot-ng: 2.0.0
pydotplus: 2.0.2
pydrive: 1.3.1
pyemd: 0.5.1
pyerfa: 2.0.0.1
pygments: 2.6.1
pygobject: 3.26.1
pylev: 1.4.0
pymc: 4.1.4
pymeeus: 0.5.12
pymongo: 4.3.3
pymystem3: 0.2.0
pyopengl: 3.1.6
pyopenssl: 22.1.0
pyparsing: 3.0.9
pyrsistent: 0.19.2
pysimdjson: 3.2.0
pysndfile: 1.3.8
pysocks: 1.7.1
pystan: 3.3.0
pytest: 3.6.4
python-apt: 0.0.0
python-dateutil: 2.8.2
python-louvain: 0.16
python-slugify: 7.0.0
python-utils: 3.4.5
pytorch-lightning: 1.8.6
pytz: 2022.6
pyviz-comms: 2.2.1
pywavelets: 1.4.1
pyyaml: 5.4.1
pyzmq: 23.2.1
qdldl: 0.1.5.post2
qudida: 0.0.4
regex: 2022.6.2
requests: 2.23.0
requests-oauthlib: 1.3.1
resampy: 0.4.2
roman: 2.0.0
rpy2: 3.5.5
rsa: 4.7.2
s3fs: 2022.11.0
s3transfer: 0.6.0
scikit-image: 0.18.3
scikit-learn: 1.0.2
scipy: 1.7.3
screen-resolution-extra: 0.0.0
scs: 3.2.2
seaborn: 0.11.2
send2trash: 1.8.0
sentry-sdk: 1.9.0
setproctitle: 1.3.2
setuptools: 57.4.0
setuptools-git: 1.2
shapely: 2.0.0
shortuuid: 1.0.11
six: 1.15.0
sklearn-pandas: 1.8.0
smart-open: 6.3.0
smmap: 5.0.0
snowballstemmer: 2.2.0
sortedcontainers: 2.4.0
soundfile: 0.11.0
spacy: 3.4.4
spacy-legacy: 3.0.10
spacy-loggers: 1.0.4
sphinx: 1.8.6
sphinxcontrib-serializinghtml: 1.1.5
sphinxcontrib-websupport: 1.2.4
sqlalchemy: 1.4.45
sqlparse: 0.4.3
srsly: 2.4.5
statsmodels: 0.12.2
sympy: 1.7.1
tables: 3.7.0
tabulate: 0.8.10
tblib: 1.7.0
tenacity: 8.1.0
tensorboard: 2.9.1
tensorboard-data-server: 0.6.1
tensorboard-plugin-wit: 1.8.1
tensorboardx: 2.5.1
tensorflow: 2.9.2
tensorflow-datasets: 4.6.0
tensorflow-estimator: 2.9.0
tensorflow-gcs-config: 2.9.1
tensorflow-hub: 0.12.0
tensorflow-io-gcs-filesystem: 0.28.0
tensorflow-metadata: 1.12.0
tensorflow-probability: 0.17.0
termcolor: 2.1.1
terminado: 0.13.3
testpath: 0.6.0
text-unidecode: 1.3
textblob: 0.15.3
thinc: 8.1.5
threadpoolctl: 3.1.0
tifffile: 2022.10.10
toml: 0.10.2
tomli: 2.0.1
toolz: 0.12.0
torch: 1.13.0+cu116
torchaudio: 0.13.0+cu116
torchmetrics: 0.11.0
torchsummary: 1.5.1
torchtext: 0.14.0
torchvision: 0.14.0+cu116
tornado: 6.0.4
tqdm: 4.64.1
traitlets: 5.7.1
tweepy: 3.10.0
typeguard: 2.7.1
typer: 0.7.0
typing-extensions: 4.4.0
tzlocal: 1.5.1
uritemplate: 3.0.1
urllib3: 1.25.11
vega-datasets: 0.9.0
wandb: 0.13.7
wasabi: 0.10.1
wcwidth: 0.2.5
webargs: 8.2.0
webencodings: 0.5.1
werkzeug: 1.0.1
wheel: 0.38.4
widgetsnbextension: 3.6.1
wordcloud: 1.8.2.2
wrapt: 1.14.1
xarray: 2022.12.0
xarray-einstats: 0.4.0
xgboost: 0.90
xkit: 0.0.0
xlrd: 1.2.0
xlwt: 1.3.0
yarl: 1.8.2
yellowbrick: 1.5
zict: 2.2.0
zipp: 3.11.0
System:
OS: Linux
architecture:
64bit
processor: x86_64
python: 3.8.16
version: Lightning-AI/pytorch-lightning#1 SMP Fri Aug 26 08:44:51 UTC 2022
More info
What I really want for christmas this year, all packaged together:
I have a CSVLogger that persists to s3.
I have a WandbLogger that saves checkpoints to Wandb.
I have an S3 trainer.default_root_dir that also saves checkpoints to s3.
cc @awaelchli @morganmcg1 @borisdayma @scottire @parambharat @manangoel99
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
Bug description
Cloud checkpoints are cool! But once you use the WandbLogger, no cloud checkpoints (or anything really) is saved to
trainer.default_root_dir
. The model is checkpointed as a Wandb artifact, which is cool, but I want it also intrainer.default_root_dir
's s3 bucket.There reason I want this:
wandb
checkpoints are good if you want to go back and find something from six months ago.Related bug Lightning-AI/pytorch-lightning#16196 . See 'More info' at the bottom of this issue.
There are some related issues: https://github.com/Lightning-AI/lightning/pull/14325 https://github.com/Lightning-AI/lightning/issues/5935 https://github.com/Lightning-AI/lightning/issues/11769 https://github.com/Lightning-AI/lightning/issues/15539 https://github.com/Lightning-AI/lightning/issues/2318 https://github.com/Lightning-AI/lightning/issues/2161 but I haven't found this specifically.
How to reproduce the bug
Here is a google colab that replicates this and a related bag. I share the code for both because it's easier to configure the AWS credentials and see both bugs simultaneously.
Copying and pasting the most important bit (but see the colab for a full minimal replication):
More info
What I really want for christmas this year, all packaged together:
trainer.default_root_dir
that also saves checkpoints to s3.cc @awaelchli @morganmcg1 @borisdayma @scottire @parambharat @manangoel99