Tokenizer questions - Githubissues

apoorvumang / kgt5

ACL 2022: Sequence-to-Sequence Knowledge Graph Completion and Question Answering (KGT5)

Apache License 2.0

98 stars 18 forks source link

Open screemix opened 1 year ago

screemix commented 1 year ago

In the paper you explicitly mentioned that you trained a BPE tokenizer for your experiments:

However, in the code of the dataset.py you used T5TokenizerFast that is based on Unigram:

Moreover, you used pertained tokenizer in the code:

Could you please clarify which tokenizer configurations were used in your experiments for their reproducibility?

And also, could you please specify the vocabulary size for WN18RR, FB15k-237, and YAGO3-10 as there is no info about these datasets in the paper?

apoorvumang commented 1 year ago

Hi @screemix , thanks for your interest!

The code in main branch is old and is not the one used for final results - please see code in branch apoorv-dump for that. There, we used custom tokenizer trained using SentencePiece library with BPE
Unfortunately we do not have vocab sizes for those datasets (we did not keep a record, and the servers on which training was done are no longer accessible to me). However my best guess is that vocab size for WN18RR abd FB15k-237 was around 10k tokens (larger no. of tokens threw some kind of BPE issue), and around 25k-30k for YAGO3-10