Closed kabirahuja2431 closed 3 years ago
Hi Kabir,
we've used the UDv2.6 treebanks. I'm not sure in which ways they differ from v2.8, and whether that might be the reason for discrepancies, but do let me know if it fixes your issue.
Hi Phillip,
Thanks for your quick response. I tried running my experiments with UDv2.6 treebanks and the discrepancies still exist. Here are the numbers for Proportion of continued words and fertility metrics that I am getting for mBERT:
Language | Proportion of continued words | Fertility |
---|---|---|
ar | 0.11 | 1.21 |
en | 0.10 | 1.16 |
fi | 0.57 | 2.19 |
id | 0.25 | 1.40 |
ja | 0.05 | 1.06 |
ko | 0.67 | 2.28 |
ru | 0.38 | 1.80 |
tr | 0.55 | 2.05 |
zh | 0.46 | 1.50 |
This is how I am selecting the ud files for each language:
data_dir = "data/ud-treebanks-v2.6"
languages = ["ar", "en", "fi", "id", "ja", "ko", "ru", "tr", "zh"]
dev_files = glob.glob(os.path.join(data_dir, "*", "*dev.conllu"))
train_files = glob.glob(os.path.join(data_dir, "*", "*train.conllu"))
language_ud_dict = {}
mBERT_ud_dict = {}
xlmr_ud_dict = {}
for l in languages:
# find all dev and train files for given language
l_files = [dev_file for dev_file in dev_files if dev_file.split("/")[-1].startswith(f"{l}_")]
l_files.extend([train_file for train_file in train_files if train_file.split("/")[-1].startswith(f"{l}_")])
# add files to dictionaries
language_ud_dict[l] = {"files": l_files}
mBERT_ud_dict[l] = {"files": l_files}
xlmr_ud_dict[l] = {"files" : l_files}
After this I am pretty much following the explore_tokenizers.ipynb
notebook, so I am not sure where the discrepancy is arising. The only variable between our codes seems to be the UD data. Will it be possible for you to share your ud data directory with me, that should help me a lot on debugging this issue.
Thanks for you valuable time, Kabir
I also suspect you are using a different combination of treebanks than we did (at least for some languages). I uploaded the data here. It should hopefully allow you to reproduce the results :). As a sanity check, when the explore_tokenizers.ipynb
notebook spits out the word counts during tokenization, you can sum them up for every language and check that it matches the numbers in table 6 (appendix) of our paper.
The raw numbers I'm getting are | Language | Prop. continued words | Fertility |
---|---|---|---|
ar | 0.405564 | 1.758088 | |
en | 0.125764 | 1.205968 | |
fi | 0.572893 | 2.193600 | |
id | 0.253820 | 1.405408 | |
ja | 0.371535 | 1.463610 | |
ko | 0.672935 | 2.284977 | |
ru | 0.385303 | 1.802988 | |
tr | 0.557433 | 2.047063 | |
zh | 0.461057 | 1.503509 |
Thanks a lot for your help. I am now able to reproduce the exact numbers in the paper (and the ones you shared). Really appreciate your help. Closing the issue.
Hello,
I was trying to reproduce some of the experiments (related to tokenizer metrics) in the paper and I am getting slightly different values for the fertility and continuation metrics. I was wondering if I was using a different version of UD-Treebank (I am using 2.8) that might be causing this discrepancy. It would be great if you can help me with the version used for the experiments in the paper.
Thanks