TRI-ML / prismatic-vlms

A flexible and efficient codebase for training visually-conditioned language models (VLMs)
MIT License
425 stars 194 forks source link

move barrier to before saving checkpoint to reduce timeouts when saving #38

Open jensen-gao opened 3 months ago

jensen-gao commented 3 months ago

Discussed this offline previously. After more extensive usage with this change, I've found that checkpoints still save properly, and barrier timeouts during saving happen much less often (have not had one in the past ~2 months).