Supplementary Vocab File

himanshisyadav commented 1 year ago

Hi

I may be missing something and am not quite sure what the supplementary vocab file is for. Can someone give me a rough idea or resources to look into?

Thanks!

ChangwenXu98 commented 1 year ago

Hi, thanks for raising the issue. We need supplementary vocab files for tokenization. The tokenizer we use here is adapted from the pretrained Roberta tokenizer, whose vocab.json and merges.txt are not explicitly included in the repo. Since the tokenizer is not trained for our case, it's possible that some pieces of words which is expected to be single tokens will be tokenized into multiple ones. This is especially important for the descriptors which are appended to polymer SMILES. For example, a nan token for Tg "NAN_Tg" will not be a single token after tokenization. Hence, we have supplementary vocab files for each dataset which contain those tokens. When using the tokenizer, we add those tokens to the vocab.json of Roberta tokenizer so that the tokenizer "knows" to tokenize them into single tokens. This step corresponds to line 231-233 in Downstream.py. For more information about tokenizer, you can refer to the some introduction or tutorial by huggingface. Here are two sources that I found: https://huggingface.co/docs/transformers/tokenizer_summary https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt Let me know if you have any other questions.

himanshisyadav commented 1 year ago

Thank you for your explanation! This was very helpful!

himanshisyadav commented 1 year ago

Sorry, I actually have a question now! How do you create the supplemental vocab file for your own corpus? Do you go over your own corpus and find tokens that are missing from RoBERTa and then add those to the supplemental vocab file?

ChangwenXu98 commented 1 year ago

You can do that, or you can simply add those tokens (for example, discrete numbers, or NAN tokens, etc.) into the sup vocab file. There might be overlapping with the original vocab file but it's not a problem.

ChangwenXu98 / TransPolymer

Supplementary Vocab File #8