fhamborg / Giveme5W1H

Extraction of the journalistic five W and one H questions (5W1H) from news articles: who did what, when, where, why, and how?
Apache License 2.0
512 stars 89 forks source link

What preprocessing is needed before feeding the text to the library? #56

Open MarwaEssam opened 4 years ago

MarwaEssam commented 4 years ago

Here is a link to a news article I am processing :https://www.washingtonpost.com/news/animalia/wp/2017/07/10/teen-camper-wakes-up-to-crunching-noise-and-discovers-his-head-is-inside-bears-mouth/

And this is the result I got from the library after giving it the paragraphs text (without any preprocessing) using the following code: text = "Teen camper wak........." title = "Teen camper......." lead = " Asleep in the mountains....." date_publish = '2017-07-10 16:17:00' doc = Document(title, lead, text, date_publish) doc = extractor.parse(doc)

Here are the results I got for the top answers:

Who-->Teen camper , 1.0 (Dylan 0.9077324478178369) What-->wakes up to ‘ crunching noise ’ , 1.0 When-->A day later , 0.8240795304744271 Where-->Boulder , Colo. , 0.6813391706278147 Why-->Teen camper , 0.5860000000000001 how-->Asleep in the mountains northwest of Boulder , Colo. , , 1.0

A clear and concise description of your question. May you please guide me on how to make this result better? What preprocessing is needed? Are there any parameters I can tune? How about the enhancer? I tried to use it as in the example but there is no enhancer package found in the code. Versions The latest

I am trying to match documents based on the events they mention (event-based linking)

MarwaEssam commented 4 years ago

What enhancer are you referring to and where is this example?

The one in this file : parse_documents_with_enhancer.py (check the code in the library)