Closed Tpt closed 9 years ago
It would be very nice to create some kind of automatized tests
Already done:
I think the issue come from #106, since there is a nn
dependency between canal
(undef
tag) and Panama
(LOCATION
tag).
Very nice tests, but I was thinking more about something at the Platypus level (that would just check that we don't get a "no answer" result).
But as you have now integration tests, the priority is really lower.
I don't know if it is linked but questions like "Who are the daughters of Louis XIV?" don't work anymore
Yes, same thing: nn
dependency, and Louis
is tagged LOCATION
(wtf?!) whereas XIV
is tagged undef
.
The big problem is that these questions are exactly the same as "Who is the France president?".
In both cases, there is the dependency X -nn-> Y
where X
is tagged undef
and Y
is not.
Thus, with a grammatical approach, we will merge "Louis XIV" if and only if we merge "France president".
This is again our problem of "named entity recognition" (NER).
Who is Louis XIV?
→ Who is “Louis XIV”?
and Who is the France president?
→ Who is the “France” “president”?
.Is “Who is the France president?” valid English?
Good question, I don't know. Asked on StackExchange.
Who are the daughters of Louis XIV?
If you use the latest version of the Stanford parser, both Louis
and XIV
are tagged LOCATION
, and so we obtain the right triple.
Where is the Panama Canal?
It works if you put an uppercase letter for Canal
.
Where is the Panama canal?
Not so bad, we obtain: ((Panama,canal,?),location,?)
. If canal
was tagged correctly (ie LOCATION
), it will be fine (and this is what happens when you write Canal
).
Who is the France president? / Where is the Panama Canal?
All these questions are equivalent:
Actually we do not merge (Panama <> canal, US <> president, United States <> president).
Maybe we could improve it ourself (can it be trained?).
yes
Otherwise we could do some "wikidata NER" in preprocessing: scanning the sentence, and when we see a group of words which represent a wikidata item or alias we put them into quotation marks. For instance, we would do Who is Louis XIV? → Who is “Louis XIV”? and Who is the France president? → Who is the “France” “president”?.
yes, see https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/issues/64 and https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/issues/85 (and i propose to close this issue since 2 other ones are opened on the same topic)
How to update its version of the stanford parser:
Scripts
, start the servor with CORENLP="stanford-corenlp-full-2015-01-30" CORENLP_OPTIONS="-parse.flags \" -makeCopulaHead\"" python3 -m corenlp
If you use the latest version of the Stanford parser, both Louis and XIV are tagged LOCATION, and so we obtain the right triple.
I have the latest version of the Stanford Parser.
In the question "Who is Louis XIV?", "Louis" and "XIV" are tagged PERSON
.
In the question "Who is the daughter of Louis XIV?", "Louis" is tagged LOCATION
and "XIV" is not tagged. And even is "XIV" was tagged LOCATION
, we do not want to use this, since the tag is wrong.
According to StackExchange, the only correct form without using "of" is "Who is France's president?".
In the question "Who is the daughter of Louis XIV?", "Louis" is tagged LOCATION and "XIV" is not tagged.
It's strange. I add it to deep_tests, let's travis decides: https://travis-ci.org/ProjetPP/PPP-QuestionParsing-Grammatical/builds/51077581
Travis is with me :) Are you sure that you run the latest version: CORENLP="stanford-corenlp-full-2015-01-30" CORENLP_OPTIONS="-parse.flags \" -makeCopulaHead\"" python3 -m corenlp
instead of CORENLP="stanford-corenlp-full-2014-08-27" CORENLP_OPTIONS="-parse.flags \" -makeCopulaHead\"" python3 -m corenlp
?
My bad, I had the two installations in conflict...
According to StackExchange, "Who is the France president" is incorrect.
I think we whould use our previous heuristic for nn
dependency: always merge.
I reopen the issue since it is no more a problem of NER.
What about the following ones:
We need to be sure that these questions are incorrect and will not be use in practice by the users
seems not used except for US ...
nn
relation) > lemmatization is not able to convert french
into france
Same thing, you juste replaced "France" by "US" and "United States"...
We do not have to handle incorrect sentences (for the same reason, we do not have any spell-checker within our module: we suppose the input sentence to be correct).
Moreover, this sentences seems to be very odd to the native speakers, so they should not be asked very often.
Who is the French president? (not an nn relation) > lemmatization is not able to convert french into france
This is not the subject of this issue...
Now we produce (Panama canal, location, ?)
Original post: "Where is the Panama canal" is broken Link: http://askplatyp.us/?lang=en&q=Where+is+the+Panama+canal%3F
It would be very nice to create some kind of automatized tests (maybe using log data) in order to avoid such regressions. @Ezibenroc Could you do it?
EDIT (by Ezibenroc)
The
nn
dependency heuristic does not work well on simple questions:LOCATION
)LOCATION
)