IBM / regression-transformer

Regression Transformer (2023; Nature Machine Intelligence)
https://www.nature.com/articles/s42256-023-00639-z
MIT License
135 stars 21 forks source link

XLNetTokenizer and BertExpressionTokenizer #10

Closed pjuangph closed 1 year ago

pjuangph commented 1 year ago
  1. I see you guys have 2 tokenizers, Bert and XLNET. How are you using the two tokenizers?
  2. Can you give more detail on how you go from the vocab to training in the readme file?
jannisborn commented 1 year ago

Hi @pjuangph,

Thanks for the interest.

  1. Even though our backbone is XLNet, we use a tokenizer class that inherits from the BERT tokenizer (called ExpressionBertTokenizer). That's not problematic, because we dont finetune XLNet, we train it entirely from scratch on molecules (original XLNet was only trained on natural text). So, generally we use this ExpressionBertTokenizer for modeling small molecules, proteins and chemical reactions. But we have one application on a NLP dataset. For that one dataset, we finetune the original XLNet model and hence wrote this XLNetRTTokenizer which inherits from the XLNetTokenizer.
  2. Not super sure about this, can you ask more specifically maybe? The README describes how to launch a training so it's covered in principle. Do you need intuition about the parameters to set?

Hope this helps

jannisborn commented 1 year ago

Closing this, feel free to reopen if applicable