dbpedia-spotlight / pignlproc

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
17 stars 14 forks source link

Clean and restructure indexing/ directory #10

Closed chrishokamp closed 11 years ago

chrishokamp commented 11 years ago

This pull request cleans out examples/indexing/ by moving non-core scripts to examples/indexing/experimental

maxjakob commented 11 years ago

Did you accidentally loose the redirect information for it in the later commit? By the way this change would probably have made a good separate commit :wink:

maxjakob commented 11 years ago

Do you think we still need *-macros.pig after #8 is merged now?

chrishokamp commented 11 years ago

Did you accidentally loose the redirect information for it in the later commit? yes, the redirect info got lost somehow. i'm honestly not sure how that happened. Sorry about the messiness of this request -- going forward, I'll be much more careful about how I commit.

Do you think we still need *-macros.pig after #8 is merged now? no, i think #8 fixes it properly -- loader-macros.pig was a stub anyway -- should I fix and submit a new request?

maxjakob commented 11 years ago

No problem. Please either fix the commits in this branch and force push or commit fixes to this branch.

chrishokamp commented 11 years ago

things should be fixed now. any comments?

maxjakob commented 11 years ago

Don't want to be pedantic, but the *-macros.pig files are still there. And since these are too many commits now, it would be good to squash some of them, commit them to the same branch and --force the push to the GitHub remote when done.

chrishokamp commented 11 years ago

Ok, sure -- just to clarify - I should use git rebase -i ... to squash the commits into a more coherent flow, then --force the push to the same remote branch (update).

Sorry for the noob questions.

On Thu, Jun 20, 2013 at 4:05 PM, Max Jakob notifications@github.com wrote:

Don't want to be pedantic, but the *-macros.pig files are still there. And since these are too many commits now, it would be good to squash some of them, commit them to the same branch and --force the push to the GitHub remote when done.

— Reply to this email directly or view it on GitHubhttps://github.com/dbpedia-spotlight/pignlproc/pull/10#issuecomment-19759277 .

maxjakob commented 11 years ago

Yes, you got it exactly.

areggiori commented 11 years ago

Hello, I am trying to run the latest pignlproc to generate the Spotlight models and when running:

    pig -m examples/indexing/token_counts.pig.params examples/indexing/token_counts.pig

I get the following error:

     2013-06-21 13:02:14,903 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Undefined parameter : MACROS_DIR

I have seen there have been changes to the token_counts.pig script recently including the macros stuff - but the various index_db.sh scripts out there do not use/set MACROS_DIR in any way.

Which version of pignlproc is stable enough to use nowadays?

Thank you.

chrishokamp commented 11 years ago

For now, just edit your *.params files, token_counts.pig.params and names_and_entities.pig.params, adding this line:

MACROS_DIR=/path/to/pignlproc/examples/macros/

That should fix the problem. We are in the midst of removing duplicate code using some macros.

jodaiber commented 11 years ago

Hey areggiori, Chris,

[1] should be the only 'current' index_db.sh script. If the macros break the indexing process, can you please ensure that you add the fix to 1 (do you have permissions to edit directly, Chris)?

Jo

[1] - https://github.com/dbpedia-spotlight/dbpedia-spotlight/blob/master/bin/index_db.sh

chrishokamp commented 11 years ago

hey Jo, I can't edit directly. Should I submit a pull request with a fix?

areggiori commented 11 years ago

Thank you both.

Temporary fix in my local index_db.sh is the following:

Add macros

echo "MACROS_DIR=examples/macros/" >> examples/indexing/token_counts.pig.params echo "MACROS_DIR=examples/macros/" >> examples/indexing/names_and_entities.pig.params

Missing things

echo "MIN_SURFACE_FORM_LENGTH=2" >> examples/indexing/token_counts.pig.params

However, now when I run examples/indexing/token_counts.pig I get the following error:

Pig Stack Trace

ERROR 1000: Error during parsing. Encountered " "20 "" at line 12, column 2. Was expecting one of:

"cat" ... "clear" ... "fs" ... "sh" ... "cd" ... "cp" ... "copyFromLocal" ... .... It seems my Pig 0.11.1 (r1459641) does not like the underscores (i.e. _id) at: -- Get articles (IDs and pairs are not used (and not produced)) _ids, articles, _pairs = read('$INPUT', '$LANG', $MIN_SURFACE_FORM_LENGTH); I am not a Pig expert, and I am not sure what's a quick fix to be able to run the index_db.sh workflow easily. Any hint appreciated. Cheers Alberto
maxjakob commented 11 years ago

Thanks for the hint! Leading underscores are now deleted from alias names.

index_db.sh still needs to be fixed. @chrishokamp, pull request sounds great! :smiley:

chrishokamp commented 11 years ago

closing this pull request as it's too big. will add changes to another branch and submit smaller requests.