About Training Data - Githubissues

BLLIP / bllip-parser

BLLIP reranking parser (also known as Charniak-Johnson parser, Charniak parser, Brown reranking parser) See http://pypi.python.org/pypi/bllipparser/ for Python module.

227 stars 53 forks source link

1) One possible point of confusion here is that the DATA directories are actually parsing models (the numbers are counts, probabilities, etc.) not the actual training data (treebanks). For the included DATA directories, the actual training data is the Penn Treebank (EN and LM) and Chinese Treebank (CH). If you have these treebanks, you can add other treebanks to them and then train a combined model.

2) The training script (trainParser) helps construct the parsing model directories (converts the real training trees to the various files inside the model directory). See the READMEs in the first-stage and TRAIN directories for more information. See my answer in #27 for where you can download or license some treebanks.

Hope this helps -- please let me know if I can clarify anything.

BLLIP / bllip-parser

About Training Data #28