ChunyuanLI / Optimus

Optimus: the first large-scale pre-trained VAE language model
368 stars 37 forks source link

Curious about the Computing Resources for Pre-training Optimus #10

Closed fakeProgrammer0 closed 4 years ago

fakeProgrammer0 commented 4 years ago

In the paper, it writes,

First, our pre-trained language VAE is still under-trained due to limited compute resource, as the training reconstruction loss can still decrease. One may further train the models with higher latent dimension and longer time to fully release the power of pre-trained latent spaces.

So how long did it take to pre-train Optimus in terms of days or weeks with its encoder and decoder initialized with weights of BERT and GPT-2 respectively?

ChunyuanLI commented 4 years ago

It takes 5 days to complete 1 epoch on Wikipedia with 8 V100 GPUs. I believe the controllability of the models can be further increased by (1) increase the latent dimension, and (2) Training longer.

fakeProgrammer0 commented 4 years ago

It takes 5 days to complete 1 epoch on Wikipedia with 8 V100 GPUs. I believe the controllability of the models can be further increased by (1) increase the latent dimension, and (2) Training longer.

Thanks for replying. How many epochs do you pre-train Optimus for? Specifically, what's the batch size and the number of pre-training steps?

ChunyuanLI commented 4 years ago

For the results reported in the paper, I used the pre-trained model with 1 epoch and latent size 32.

Here is one example for the pre-training script: https://github.com/ChunyuanLI/Optimus/blob/master/code/scripts/scripts_philly/train_vae_wikipedia_distributed.yaml

The batch size is 16 * 8 = 128 sentences. There are nearly 2M sentences in Wikipedia.

fakeProgrammer0 commented 4 years ago

Thanks for your reply : )

fakeProgrammer0 commented 4 years ago

8 a similar issue