derlem / kanarya

A deep learning model for classification of 'de/da' clitics in Turkish

21 stars 3 forks source link

Run pretraining step of BERT model on sentence-split Turkish corpus #1

Closed uskudarli closed 5 years ago

uskudarli commented 5 years ago

Use the Turkish corpus provided by Onur using pytorch-pretrained-BERT.

uskudarli commented 5 years ago

@ugurcanarikan -- this task is in progress, right? It seems to be sitting in "to do" list. I am trying to track the work being done, and what is in the list. I think things are not quite being recorded yet. Right?

ugurcanarikan commented 5 years ago

4 has just been completed and currently #5 is on progress. Once #5 is completed and pretraining data has been created I will start the pretraining.

uskudarli commented 5 years ago

gotcha. Thanks.

ugurcanarikan commented 5 years ago

Pretraining with a batch size = 32 and train steps = 20 has been completed for trial. Now pretraining with batch size = 1024 and train steps = 10000 is in progress.

uskudarli commented 5 years ago

@ugurcanarikan

You are working on the case of steps = 2.5 million now, right?

What is the status?

ugurcanarikan commented 5 years ago

In order to pretrain BERT for 10 epochs with batch size of 56, we had calculated the number of training steps to be 26.5 million. But as it would take around 80 days to complete 26.5 million steps with our RTX2080, we had decided to pretrain BERT for at least 2.6 million steps which would make 1 epoch and continue pretraining later. Currently, pretraining is at the step 3.18 million. After flair is trained with glove and Turkish fasttext embeddings, I will pause pretraining and extract BERT embedding as well to use it in training flair.