IndoNLP / indonlu

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained IndoBERT models, and a starter code! (AACL-IJCNLP 2020)
https://indobenchmark.com
Apache License 2.0
534 stars 190 forks source link

Computation Power to Pretrain the Indo4B Dataset From Scratch #41

Closed krisbianprabowo closed 1 year ago

krisbianprabowo commented 1 year ago

hi, I'm actually using one of your models for text similarity and it works great!

I wonder if I would like to pretrain the model from scratch using the Indo4B dataset, with that such a huge size (~24GB). May I know how many RAM and VRAM are needed to be able to train it with the same batch size you guys stated in your paper? i.e. for IndoBERTBASE was using 256 Batch size. Is 16 GB of VRAM and 32GB RAM enough?

Thank you for this such amazing work!

SamuelCahyawijaya commented 1 year ago

Hi @krisbianprabowo, thank you for your interest in our work. Before we pre-train the model using a TPU pod (). If you want to run 256 batch size and you don't have enough memory, you can try to use gradient accumulation to collect the gradient across smaller batches. You can further check section 4.2 in our IndoNLU paper for more detail on the pre-training

Screenshot 2022-12-06 at 9 53 50 AM

Nevertheless, I would not suggest running pre-training from scratch using a single GPU, because it will take quite some time (probably a week or two for a single run). I would instead suggest running DAPT or TAPT, which run a second phase pre-training using the existing pre-trained LM, you can check some of the references as follow:

krisbianprabowo commented 1 year ago

Thank you so much for your detailed explanation @SamuelCahyawijaya, really appreciate it!

krisbianprabowo commented 1 year ago

I'm sorry if I'm re-opening this issue again. Just to make it clear, so you guys are using cloud computing for pre-training from scratch using this kind of configuration which I showed it below. Am I right? image