datquocnguyen / RDRPOSTagger

A fast and accurate POS and morphological tagging toolkit (EACL 2014)
http://rdrpostagger.sourceforge.net
Other
140 stars 48 forks source link

Creating a lexicon #18

Closed senisioi closed 6 years ago

senisioi commented 6 years ago

Could you please add a few lines on how to use the LexiconCreator.py script? I am not sure what parameters to set to the function createLexicon. Is corpusFilePath the path to the universal dependencies file? What about fullLexicon?

datquocnguyen commented 6 years ago

Hi, some input examples are in folder data. In particular, each line the input training corpus (i.e. corpusFilePath) is a sequence of WORD/TAG pairs separated by white space characters. Parameter fullLexicon is used to specify either a full lexicon output which contains all word types or a smaller lexicon output which excludes word types appearing only 1 time in the input training corpus. E.g:

createLexicon("../data/goldTrain", 'full')

createLexicon("../data/goldTrain", 'short')