Closed maxjakob closed 11 years ago
@jodaiber and @chrishokamp, please have a quick look and comment on which other scripts could benefit from this or similar re-arranging. There are a lot of scripts in indexing/ that I don't know what they do and why. Maybe we can also put some more comments on the top of the scripts.
Otherwise looks great. The statistical backend only uses:
I don't know about the other scripts. Maybe we could move them to an "experimental" folder.
A test fails now:
OpenNLP's SimpleTokenizer
does separate "That's" into 3 tokens, LanguageIndependentStringTokenizer
does not.
Hm, I don't know how much sense these tests makes now, given that the tokenizers are more or less exchangeable assuming they are based on model.Tokenizer (as opposed to always being SimpleTokenizer). This is probably rather something that should be tested in spotlight core.
Maybe we should write some tests in the Spotlight core then. This is definitely an issue with texts like
Controversy plagues Obama's administration.
if we don't have Obama's in the dictionary.
True. The supervised OpenNLP tokenizer will handle that correctly, so in the English version at run-time it should be tokenized as "Obama 's". That's the reason we have to ensure we use the same tokenizers at run-time and in the pig scripts for determining c(sf) and c(annotated | sf); the first French model I created was barely useable because of that. Good that you found the bug with the OpenNLP models.
Fixed the test in here by using the English tokenizer model that was already in resources/.
There is a lot of duplicated Pig Latin code in
nerd-stats.pig
,names_and_entities.pig
,names_and_entities_low_memory.pig
and some more.This PR introduces macros to de-duplicate, along with a UDF to normalize URIs.
names_and_entities_low_memory.pig
is deleted for now. Line 35 or 36 innames_and_entities.pig
can be commented out/in. We could bring the separate script back if needed.Also updates maven-assembly-plugin.
Fixes #5.