Closed himanshisyadav closed 1 year ago
Hi, thanks for raising the issue. We need supplementary vocab files for tokenization. The tokenizer we use here is adapted from the pretrained Roberta tokenizer, whose vocab.json
and merges.txt
are not explicitly included in the repo. Since the tokenizer is not trained for our case, it's possible that some pieces of words which is expected to be single tokens will be tokenized into multiple ones. This is especially important for the descriptors which are appended to polymer SMILES. For example, a nan token for Tg "NAN_Tg" will not be a single token after tokenization. Hence, we have supplementary vocab files for each dataset which contain those tokens. When using the tokenizer, we add those tokens to the vocab.json
of Roberta tokenizer so that the tokenizer "knows" to tokenize them into single tokens. This step corresponds to line 231-233 in Downstream.py
. For more information about tokenizer, you can refer to the some introduction or tutorial by huggingface. Here are two sources that I found:
https://huggingface.co/docs/transformers/tokenizer_summary
https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt
Let me know if you have any other questions.
Thank you for your explanation! This was very helpful!
Sorry, I actually have a question now! How do you create the supplemental vocab file for your own corpus? Do you go over your own corpus and find tokens that are missing from RoBERTa and then add those to the supplemental vocab file?
You can do that, or you can simply add those tokens (for example, discrete numbers, or NAN tokens, etc.) into the sup vocab file. There might be overlapping with the original vocab file but it's not a problem.
Hi
I may be missing something and am not quite sure what the supplementary vocab file is for. Can someone give me a rough idea or resources to look into?
Thanks!