Improve question word extraction

ProjetPP / PPP-QuestionParsing-Grammatical

Question Parsing module for the PPP using a grammatical approch

GNU Affero General Public License v3.0

33 stars 11 forks source link

Improve question word extraction #19

Closed yhamoudi closed 10 years ago

yhamoudi commented 10 years ago

The algorithm used to identify the question words (1 or 2 words) is in questionIdentify.py.

The current algorithm is very sensitive. If you write WHo are you or who are you instead of Who are you, it fails to identify the question word.

Improvements:

not be case-sensitive
not be space-sensitive (i don't know if the stanford parser always remove all spaces)

(and transform the questionWord in order it only contains "non-capital" words (ie replace Who by who for example) )

Must be fixed for midterm (it's not difficult and it's very useful)

progval commented 10 years ago

This might help you: https://github.com/ProjetPP/PPP-NLP-classical/commit/aabb77e31da7f94b36c6d4c01d511e86d50712fd

yhamoudi commented 10 years ago

Just a question about a strange thing (it's on branch non_question_word_questions). For all questions starting by Is or What, the algorithm (in questionIdentify.py) fails to identify the question word.

Question words are defined in closeQuestionWord and openQuestionWord maps, Is and What are the 2 first items of these maps. If you place them at another position in the map, it works! In fact, only the first word of each map is not recognized. How is it possible (the problem doesn't appear on branch triple_standardize, that uses only one map for all question words)?

progval commented 10 years ago

Because the first item of the list is:

 """
        Taken from: http://www.interopia.com/education/all-question-words-in-english/
        Rarely used: Wherefore, Whatever, Wherewith, Whither, Whence, However
    What"""

(what you wanted to be a docstring was actually concatenated to the first item)

Docstrings are not denoted by the """ chars (it just says “that's a multiline string”) but are by their position in the code.

yhamoudi commented 10 years ago

ok, so no docstring in a map i imagine?

Ezibenroc commented 10 years ago

For the non-case-sensitiveness: could we not simply write all the question words of the list in lowercase, and then use .lower() method in the function identifyQuestionWord? Defining a new class just for this seems a bit "heavy"...

progval commented 10 years ago

That's what the class does, actually.