De-duplication using macros

dbpedia-spotlight / pignlproc

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.

17 stars 14 forks source link

De-duplication using macros #8

Closed maxjakob closed 11 years ago

maxjakob commented 11 years ago

There is a lot of duplicated Pig Latin code in nerd-stats.pig, names_and_entities.pig, names_and_entities_low_memory.pig and some more.

This PR introduces macros to de-duplicate, along with a UDF to normalize URIs.

names_and_entities_low_memory.pig is deleted for now. Line 35 or 36 in names_and_entities.pig can be commented out/in. We could bring the separate script back if needed.

Also updates maven-assembly-plugin.

Fixes #5.

maxjakob commented 11 years ago

@jodaiber and @chrishokamp, please have a quick look and comment on which other scripts could benefit from this or similar re-arranging. There are a lot of scripts in indexing/ that I don't know what they do and why. Maybe we can also put some more comments on the top of the scripts.

jodaiber commented 11 years ago

Otherwise looks great. The statistical backend only uses:

names_and_entities.pig and names_and_entities.pig.params
token_counts.pig and token_counts.pig.params

I don't know about the other scripts. Maybe we could move them to an "experimental" folder.

maxjakob commented 11 years ago

A test fails now: OpenNLP's SimpleTokenizer does separate "That's" into 3 tokens, LanguageIndependentStringTokenizer does not.

jodaiber commented 11 years ago

Hm, I don't know how much sense these tests makes now, given that the tokenizers are more or less exchangeable assuming they are based on model.Tokenizer (as opposed to always being SimpleTokenizer). This is probably rather something that should be tested in spotlight core.

maxjakob commented 11 years ago

Maybe we should write some tests in the Spotlight core then. This is definitely an issue with texts like

Controversy plagues Obama's administration.

if we don't have Obama's in the dictionary.

jodaiber commented 11 years ago

True. The supervised OpenNLP tokenizer will handle that correctly, so in the English version at run-time it should be tokenized as "Obama 's". That's the reason we have to ensure we use the same tokenizers at run-time and in the pig scripts for determining c(sf) and c(annotated | sf); the first French model I created was barely useable because of that. Good that you found the bug with the OpenNLP models.

maxjakob commented 11 years ago

Fixed the test in here by using the English tokenizer model that was already in resources/.