dbpedia-spotlight / pignlproc

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
17 stars 14 forks source link

Split names_and_entities.pig into two scripts #3

Closed jodaiber closed 11 years ago

jodaiber commented 11 years ago

Because we are writing out the surface forms, which are put into the distributed cache to collect ngrams, names_and_entities.pig has to either use EXEC or -no_multiquery. No multiquery exectution makes the script slower. The script should be split into two scripts: 1. collect and store surface forms 2. perform the rest of the tasks.

maxjakob commented 11 years ago

Just curious if you also tried a replicated join that loads one bag into memory and streams the other one through? Should be sort of equivalent to using the RestrictedNGramGenerator UDF.

jodaiber commented 11 years ago

This is not necessary anymore, as it works rather well with the EXEC version.