Closed Rusiecki closed 5 years ago
We haven't trained any German models, so you are free to use whatever tokenizer you want when preparing the corpus for pretraining. For English, use the Moses tokenizer.
Hi! Here is a preprocessed german text corpus based on wikipedia: https://github.com/t-systems-on-site-services-gmbh/german-wikipedia-text-corpus
Here is the preprocessing code: https://github.com/PhilipMay/de-wiki-text-corpus-tools/blob/master/process_wiki_files.py
Which I did in combinaton with Wikiextractor. See here: https://eniak.de/it/training_of_german_word_embedding_for_nlp
I tested and love SoMaJo as Tokenizer on word and sentence level: https://github.com/tsproisl/SoMaJo
Any sample code for this part here? Especially for German tokenization?