Closed jodaiber closed 11 years ago
Just curious if you also tried a replicated join that loads one bag into memory and streams the other one through? Should be sort of equivalent to using the RestrictedNGramGenerator UDF.
This is not necessary anymore, as it works rather well with the EXEC version.
Because we are writing out the surface forms, which are put into the distributed cache to collect ngrams, names_and_entities.pig has to either use EXEC or -no_multiquery. No multiquery exectution makes the script slower. The script should be split into two scripts: 1. collect and store surface forms 2. perform the rest of the tasks.