Closed lchu-ibm closed 4 months ago
@nairbv I have the same feeling that this isn't the root cause.
Re. tokenizer - yes - pretraining was trained on fully processed (preprocessing step involves tokenization) data thus there is no tokenizer involved at all.
This actually caused a little "confusing" when converting the model, as stated in the last "Note" section of https://github.com/foundation-model-stack/fms-fsdp/blob/doc/README.md
when using large global batch size, we hit this RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling
cublasCreate(handle)
error. Upon checking, this weird error seems to indicate some issue with dataloader distribute data among workers. Somehow increasing this vsize from 10k to 32k solved the larger bs issue. This also align with llama's 32k vocab size.