adapter-hub / hgiyt

Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"
https://arxiv.org/abs/2012.15613
26 stars 6 forks source link

Version of UD-Treebanks used for Tokenizer Experiments #2

Closed kabirahuja2431 closed 3 years ago

kabirahuja2431 commented 3 years ago

Hello,

I was trying to reproduce some of the experiments (related to tokenizer metrics) in the paper and I am getting slightly different values for the fertility and continuation metrics. I was wondering if I was using a different version of UD-Treebank (I am using 2.8) that might be causing this discrepancy. It would be great if you can help me with the version used for the experiments in the paper.

Thanks

xplip commented 3 years ago

Hi Kabir,

we've used the UDv2.6 treebanks. I'm not sure in which ways they differ from v2.8, and whether that might be the reason for discrepancies, but do let me know if it fixes your issue.

kabirahuja2431 commented 3 years ago

Hi Phillip,

Thanks for your quick response. I tried running my experiments with UDv2.6 treebanks and the discrepancies still exist. Here are the numbers for Proportion of continued words and fertility metrics that I am getting for mBERT:

Language Proportion of continued words Fertility
ar 0.11 1.21
en 0.10 1.16
fi 0.57 2.19
id 0.25 1.40
ja 0.05 1.06
ko 0.67 2.28
ru 0.38 1.80
tr 0.55 2.05
zh 0.46 1.50

This is how I am selecting the ud files for each language:

data_dir = "data/ud-treebanks-v2.6"
languages = ["ar", "en", "fi", "id", "ja", "ko", "ru", "tr", "zh"]

dev_files = glob.glob(os.path.join(data_dir, "*", "*dev.conllu"))
train_files = glob.glob(os.path.join(data_dir, "*", "*train.conllu"))

language_ud_dict = {}
mBERT_ud_dict = {} 
xlmr_ud_dict = {}
for l in languages:
    # find all dev and train files for given language
    l_files = [dev_file for dev_file in dev_files if dev_file.split("/")[-1].startswith(f"{l}_")]
    l_files.extend([train_file for train_file in train_files if train_file.split("/")[-1].startswith(f"{l}_")])
    # add files to dictionaries
    language_ud_dict[l] = {"files": l_files}
    mBERT_ud_dict[l] = {"files": l_files}
    xlmr_ud_dict[l] = {"files" : l_files}

After this I am pretty much following the explore_tokenizers.ipynb notebook, so I am not sure where the discrepancy is arising. The only variable between our codes seems to be the UD data. Will it be possible for you to share your ud data directory with me, that should help me a lot on debugging this issue.

Thanks for you valuable time, Kabir

xplip commented 3 years ago

I also suspect you are using a different combination of treebanks than we did (at least for some languages). I uploaded the data here. It should hopefully allow you to reproduce the results :). As a sanity check, when the explore_tokenizers.ipynb notebook spits out the word counts during tokenization, you can sum them up for every language and check that it matches the numbers in table 6 (appendix) of our paper.

The raw numbers I'm getting are Language Prop. continued words Fertility
ar 0.405564 1.758088
en 0.125764 1.205968
fi 0.572893 2.193600
id 0.253820 1.405408
ja 0.371535 1.463610
ko 0.672935 2.284977
ru 0.385303 1.802988
tr 0.557433 2.047063
zh 0.461057 1.503509
kabirahuja2431 commented 3 years ago

Thanks a lot for your help. I am now able to reproduce the exact numbers in the paper (and the ones you shared). Really appreciate your help. Closing the issue.