JunjieHu / xtreme-dev

Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME)
MIT License
22 stars 4 forks source link

Issues with download_udpos #1

Open zphang opened 4 years ago

zphang commented 4 years ago

Hi,

I'm currently running the download script for XTREME. I'm running into some issues with the downloading and preprocessing of the UD data, and wanted to check if some of these are an issue with my setup or an issue with the provided code.

  1. The script uses the third party ud-conversion-tools file $REPO/third_party/ud-conversion-tools/conllu_to_conll.py. However, the script contains the line
    from lib.conll import CoNLLReader
    whereas the lib folder from ud-conversion-tools has not been included in the $REPO/third_party/ud-conversion-tools folder. I was able to get around this by separately git cloning from https://github.com/coastalcph/ud-conversion-tools and adding that to my PYTHONPATH
  2. After correcting for the above, it looks like a good number of the preprocessing commands for UD are able to work, but a small number still run into some errors (or warnings). Are these to be expected? (These are just messages I grabbed during my run)

Case 1.

python /mypath/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py /mypath/xtreme/download//udpos-tmp/ud-treebanks-v2.5/UD_Dutch-Alpino/nl_alpino-ud
-train.conllu /mypath/xtreme/download//udpos-tmp/conll//nl//nl_alpino-ud-trai
n.conll --lang nl --replace_subtokens_with_fused_forms --print_fused_forms
Traceback (most recent call last):
  File "/mypath/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py", l
ine 53, in <module>
    main()
  File "/mypath/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py", l
ine 41, in main
    orig_treebank = cio.read_conll_u(args.input)#, args.keep_fused_forms, args.lang, POSRANKPRECEDENC
EDICT)
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 350, in read_conll_
u
    token_dict = {key: conv_fn(val) for (key, conv_fn), val in zip(self.CONLL_U_COLUMNS, parts)}
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 350, in <dictcomp>
    token_dict = {key: conv_fn(val) for (key, conv_fn), val in zip(self.CONLL_U_COLUMNS, parts)}
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 26, in parse_deps
    return [(int(pair[0]), pair[1]) for pair in dep_pairs]
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 26, in <listcomp>
    return [(int(pair[0]), pair[1]) for pair in dep_pairs]
ValueError: invalid literal for int() with base 10: '5.1'

Case 2.

Not a tree after fused-form heuristics: غزة 15 - 8 ( اف ب ) - حذرت الجبهة الشعبية لتحرير فلسطين وحزب 
الخلاص الوطني ، الاسلامي القريب من حركة حماس ، من اية محاولات او اف منه الى وكالة فرانس برس الى " ضرو
رة الحفاظ على المصداقية في هذا الخصوص والا فان الدولة ستتحول الى ورقة استهلاكية تستخدم في المناسبات "
 .

Case 3.

Traceback (most recent call last):
  File "/mypath/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py", l
ine 53, in <module>
    main()
  File "/mypath/xtreme/third_party/ud-conversion-tools/conllu_to_conll.py", l
ine 48, in main
    s.filter_sentence_content(args.replace_subtokens_with_fused_forms, args.lang, current_pos_precede
nce_list,args.remove_node_properties,args.remove_deprel_suffixes,args.remove_arabic_diacritics)
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 219, in filter_sent
ence_content
    self._keep_fused_form(posPreferenceDict)
  File "/mypath/xtreme/ud-conversion-tools/lib/conll.py", line 179, in _keep_fused
_form
    deprel = self[localhead][ext_dep]["deprel"]
KeyError: 3

Thanks!

JunjieHu commented 4 years ago

Hi @zphang Thanks a lot for pointing out the issues w/ detailed cases!

  1. The .gitignore file made me miss the lib folder. I've just uploaded my modified conll.py file. For the particular error in your case1, there are a very small number of words in some files that have non-integer indexes. So I filtered them out by: https://github.com/JunjieHu/xtreme/blob/develop/third_party/ud-conversion-tools/lib/conll.py#L28

  2. That warming is because the heuristic conversion breaks down a single tree structure for the sentence. Since we are doing mostly on the POS tagging task, that should be fine. I also commented that warming. https://github.com/JunjieHu/xtreme/blob/develop/third_party/ud-conversion-tools/lib/conll.py#L229

  3. If you use my uploaded file, there should not be such errors. I just test the download script one more time in a fresh new machine.