dmamur / elembert

17 stars 3 forks source link

Question about the pretraining training data. #2

Closed whikwon closed 8 months ago

whikwon commented 8 months ago

Thank you for sharing the excellent code.

I have one question: is it correct to use BERT for pretraining and then fine-tune it on a specific dataset not used during pretraining? If so, could you provide information on the dataset used for pretraining?

Thanks.

2shakir commented 8 months ago

Thank you for sharing the excellent code.

I have one question: is it correct to use BERT for pretraining and then fine-tune it on a specific dataset not used during pretraining? If so, could you provide information on the dataset used for pretraining?

Thanks.

Hi, Thank you for your interest and your question. Yes, the common practice involves pretraining a BERT model on a large corpus and fine-tuning it for specific tasks using a distinct dataset, a methodology that aligns with our model. In our approach, the pre-trained model could be trained on structural data like crystals (CIF), SMILES, or any xyz file to obtain element embeddings reflecting structural characteristics. Then a pretrained model could be fine-tuned on any dataset. Our next objective is to train using various chemical databases (oqmd, Crystallography Open Database, Materials Project, rscb ...), and we may refer to the JARVIS project for its comprehensive compilation of structural data. Don't hesitate to ask any questions.

whikwon commented 8 months ago

Thanks for the quick response. I'm trying to fine-tune following the method you shared in the notebook without pretraining. I noticed that the repository uses the el2idV0.pkl file for fine-tuning. Could you please let me know which dataset this checkpoint was trained on?

Thanks

2shakir commented 8 months ago

I noticed that the repository uses the el2idV0.pkl file for fine-tuning. Could you please let me know which dataset this checkpoint was trained on?

This file is not a pretraining checkpoint; it is only a vocabulary file containing elements and their corresponding IDs (tokens). The pretraining model, along with details and strategy for fine-tuning, will be published later, together with a comprehensive description. Additional tests are underway. As a note, our pre-training is performed on the Materials Project (MP) and Crystallography Open Database (COD) databases.

whikwon commented 8 months ago

Ah, I understand. Thank you for the kind response. :)