Tokenization - Githubissues

allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models

Apache License 2.0

1.62k stars 452 forks source link

Tokenization #184

Closed Rusiecki closed 5 years ago

Rusiecki commented 5 years ago

Any sample code for this part here? Especially for German tokenization?

matt-peters commented 5 years ago

We haven't trained any German models, so you are free to use whatever tokenizer you want when preparing the corpus for pretraining. For English, use the Moses tokenizer.

PhilipMay commented 5 years ago

Hi! Here is a preprocessed german text corpus based on wikipedia: https://github.com/t-systems-on-site-services-gmbh/german-wikipedia-text-corpus

Here is the preprocessing code: https://github.com/PhilipMay/de-wiki-text-corpus-tools/blob/master/process_wiki_files.py

Which I did in combinaton with Wikiextractor. See here: https://eniak.de/it/training_of_german_word_embedding_for_nlp

I tested and love SoMaJo as Tokenizer on word and sentence level: https://github.com/tsproisl/SoMaJo