Alina-enni / lingdiggers

Project for the Building NLP Applications course
0 stars 0 forks source link

Step 6 #14

Closed Alina-enni closed 2 years ago

Alina-enni commented 2 years ago
  1. If you like, you can use some other data, for instance a Wikipedia dump for some other language, such as Finnish: https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/. First you need to download a Wikipedia XML file. Then you need to uncompress it with bunzip2. Then you need to convert the XML format to plain text using the Perl script xml2txt.pl, which is available for download on the web page. You need to use the option -articles in order to preserve the article tags: perl xml2txt.pl -articles INPUT_FILE.xml OUTPUT_FILE.txt.