dbpedia-spotlight / pignlproc

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
17 stars 14 forks source link

Train/Test split #2

Closed jodaiber closed 11 years ago

jodaiber commented 11 years ago

Create a script that splits the input corpus into training/test before it is processed by pignlproc.

jodaiber commented 11 years ago

Added https://github.com/dbpedia-spotlight/pignlproc/blob/master/utilities/split_train_test.py

This script removes n percent of annotated paragraphs from the MediaWiki dump and writes them to a test file. The training file remains valid XML.

$ bzcat nlwiki-latest-pages-articles.xml.bz2 | python split_train_test.py 0.05 nl_test.txt > nl_train.xml