emorynlp / nlp4j

NLP framework for JVM languages.
http://emorynlp.github.io/nlp4j/
Other
149 stars 33 forks source link

Another questions about training data #31

Open komelianchuk opened 6 years ago

komelianchuk commented 6 years ago

Hi. Thank you for your amazing project. I'm trying to retrain NER model and want to understand a couple of moments which are not clear for me: 1 .I'm curious about the size of named entity gazetteers and about the possibility of the expanding this data. In the paper you mentioned that named entity gazetteers were collected from DBPedia. But could you specify the way how did you collect this data? And the size of this data?

  1. Am I right, that you use only OntoNotes for training NER (except lexica of course)?
  2. Here, you use files like "known_corporations.txt", "known_countries.txt", "known_currencies.txt", etc. Could you point me where is this data from?

Sorry, if GitHub is not the best place for my questions, but I hope your answers could help others as well.

OmarSRA commented 5 years ago

@komelianchuk Hi! We are trying to retrain NLP4J and we were wondering if you were able to obtain this data from from DBPedia? Were you able to get similar results using the ontoNotes dataset?

Thanks!

Omar