dbpedia / fact-extractor

Fact Extraction from Wikipedia Text
529 stars 79 forks source link

Extract the raw text corpus from your mother tongue Wikipedia chapter #1

Closed marfox closed 9 years ago

marfox commented 9 years ago

Run step 1.i of the workflow, as per the README. You should review the related script (especially these lines), polish it and update it accordingly.
Heads-up! Make the script more robust by letting the user provide the Wikipedia dump URL language as a command line argument (like it or italian for Italian)

kkasunperera commented 9 years ago

@marfox I have made a pull request for this issue https://github.com/dbpedia/fact-extractor/pull/4 pls check

marfox commented 9 years ago

Given the language, the script should pick the right Wikipedia dump and the right TreeTagger instance

kkasunperera commented 9 years ago

@marfox I'm getting this error when running the "extractverbs.sh" this line "cat extracted// | csplit --suppress-matched -z -f 'corpus/doc' - '//' {*}" https://github.com/dbpedia/fact-extractor/blob/master/extract_verbs.sh#L11

error- csplit: unrecognized option '--suppress-matched' Try `csplit --help' for more information.

In documentation I couldn't find any option "--suppress-matched". Can you please provide any hint to resolve this?

marfox commented 9 years ago

You need csplit (GNU coreutils) 8.23 for that

On 3/12/15 5:28 AM, Kasun Perera wrote:

@marfox https://github.com/marfox I'm getting this error when running the "extractverbs.sh" this line "cat extracted//// | csplit --suppress-matched -z -f 'corpus/doc' - '//' {*}" https://github.com/dbpedia/fact-extractor/blob/master/extract_verbs.sh#L11

error- csplit: unrecognized option '--suppress-matched' Try `csplit --help' for more information.

In documentation I couldn't find any option "--suppress-matched". Can you please provide any hint to resolve this?

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/fact-extractor/issues/1#issuecomment-78422167.