Exaphis / HackQ-Trivia

Yet another HQ Trivia bot. Automatically scrapes HQ Trivia questions without OCR and answers them.
MIT License
89 stars 54 forks source link

NN parse error #109

Closed talentoscope closed 6 years ago

talentoscope commented 6 years ago

The search.py class find_nouns appears to be slightly broken. NN and NNP tags are found, but not all are included, and I'm not quite sure why that is.

Example question:

Barney Stinson is a character from which TV show?
['Arrow', 'Community', 'How I Met Your Mother']

Searching
['Arrow', 'Community', 'How I Met Your Mother']
['barney', 'stinson', 'character', 'tv', 'show']
['https://en.wikipedia.org/wiki/Barney_Stinson', 'https://en.wikipedia.org/wiki/How_I_Met_Your_Mother', 'https://coreydemoss.wordpress.com/2011/03/24/tvs-top-10-characters-no-5-barney-stinson/', 'https://www.quora.com/How-I-Met-Your-Mother-TV-series-Is-there-a-character-like-Barney-Stinson-for-real', 'https://www.imdb.com/title/tt0460649/']
Running method 1
{'arrow': 0, 'community': 5, 'how i met your mother': 0}
community

[('Barney', 'NNP'), ('Stinson', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('character', 'NN'), ('from', 'IN'), ('which', 'WDT'), ('TV', 'NN'), ('show', 'NN'), ('?', '.')]
Question nouns: ['character', 'barney stinson']
Running method 3
Search processed
URLs fetched

Arrow: {'character': 46, 'barney stinson': 0}
Community: {'character': 14, 'barney stinson': 0}
How I Met Your Mother: {'character': 7, 'barney stinson': 14}

Keyword scores: {'Arrow': 302, 'Community': 166, 'How I Met Your Mother': 247}
Noun scores: {'Arrow': 46, 'Community': 14, 'How I Met Your Mother': 21}
Arrow
Search took 6.013584613800049 seconds

Here, "TV show" should be another instance in the list of question_key_nouns, as there are two consecutive NN tags, but it isn't. NNS tags should probably be included too I think, or really NN* since there are 4 POS tags for nouns, unless this elif:

elif "NN" in tag_type or "NNP" in tag_type:

includes them already.

Exaphis commented 6 years ago

It didn't include TV show because it includes a certain number of nouns on the opposite side of the question word. Don't know how method 1 failed to detect "how i met your mother" because it worked on my machine. For method 3, you would have to determine yourself that "barney stinson" is the more important noun.

You might not have the most updated repository, because that line of code does not exist anymore. "NN" in tag_type detects all types of nouns.