foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
https://pytorch.org/docs/stable/fsdp.html
Apache License 2.0
116 stars 18 forks source link

fix dummy dataloader #27

Closed lchu-ibm closed 4 months ago

lchu-ibm commented 4 months ago

when using large global batch size, we hit this RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) error. Upon checking, this weird error seems to indicate some issue with dataloader distribute data among workers. Somehow increasing this vsize from 10k to 32k solved the larger bs issue. This also align with llama's 32k vocab size.

lchu-ibm commented 4 months ago

@nairbv I have the same feeling that this isn't the root cause.

Re. tokenizer - yes - pretraining was trained on fully processed (preprocessing step involves tokenization) data thus there is no tokenizer involved at all.

This actually caused a little "confusing" when converting the model, as stated in the last "Note" section of https://github.com/foundation-model-stack/fms-fsdp/blob/doc/README.md