HIT-SCIR / ELMoForManyLangs

Pre-trained ELMo Representations for Many Languages
MIT License
1.46k stars 244 forks source link

Document what tokenisation was used for the offered models #47

Open jowagner opened 5 years ago

jowagner commented 5 years ago

Closed issue #45 indicates that udpipe was used and __main__.py suggests that you use the expanded form for conll multiword tokens, e.g. 2 tokens "de le" instead of "du" in French. The readme should mention both.

jowagner commented 5 years ago

However, the config.json of a downloaded model suggests that the model was not trained on a conllu file: "train_path": "/users4/conll18st/raw_text/Czech/cs-20m.raw". Has this historic reasons, i.e. was conllu input format only added later to elmoformanylangs and you used an external conlly-to-raw converter at the time?

Oneplus commented 5 years ago

was conllu input format only added later to elmoformanylangs and you used an external conlly-to-raw converter at the time?

cs-20m.raw was obtained from an external conllu-to-raw script. the original data can be found at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989 and it was preprocessed by udpipe.