NN dependency - Githubissues

Tpt commented 9 years ago

Original post: "Where is the Panama canal" is broken Link: http://askplatyp.us/?lang=en&q=Where+is+the+Panama+canal%3F

It would be very nice to create some kind of automatized tests (maybe using log data) in order to avoid such regressions. @Ezibenroc Could you do it?

EDIT (by Ezibenroc)

The nn dependency heuristic does not work well on simple questions:

"Where is the Panama canal?" ("Panama" is tagged LOCATION)
"Who are the daughters of Louis XIV?" ("Louis" is tagged LOCATION)

Ezibenroc commented 9 years ago

It would be very nice to create some kind of automatized tests

Already done:

Ezibenroc commented 9 years ago

I think the issue come from #106, since there is a nn dependency between canal (undef tag) and Panama (LOCATION tag).

Tpt commented 9 years ago

Very nice tests, but I was thinking more about something at the Platypus level (that would just check that we don't get a "no answer" result).

But as you have now integration tests, the priority is really lower.

Tpt commented 9 years ago

I don't know if it is linked but questions like "Who are the daughters of Louis XIV?" don't work anymore

Ezibenroc commented 9 years ago

Yes, same thing: nn dependency, and Louis is tagged LOCATION (wtf?!) whereas XIV is tagged undef.

Ezibenroc commented 9 years ago

The big problem is that these questions are exactly the same as "Who is the France president?". In both cases, there is the dependency X -nn-> Y where X is tagged undef and Y is not.

Thus, with a grammatical approach, we will merge "Louis XIV" if and only if we merge "France president".

This is again our problem of "named entity recognition" (NER).

We could hope that the Stanford NER will become better.
Maybe we could improve it ourself (can it be trained?).
Otherwise we could do some "wikidata NER" in preprocessing: scanning the sentence, and when we see a group of words which represent a wikidata item or alias we put them into quotation marks. For instance, we would do Who is Louis XIV? → Who is “Louis XIV”? and Who is the France president? → Who is the “France” “president”?.

progval commented 9 years ago

Is “Who is the France president?” valid English?

Ezibenroc commented 9 years ago

Good question, I don't know. Asked on StackExchange.

yhamoudi commented 9 years ago

Who are the daughters of Louis XIV?

If you use the latest version of the Stanford parser, both Louis and XIV are tagged LOCATION, and so we obtain the right triple.

Where is the Panama Canal?

It works if you put an uppercase letter for Canal.

Where is the Panama canal?

Not so bad, we obtain: ((Panama,canal,?),location,?). If canal was tagged correctly (ie LOCATION), it will be fine (and this is what happens when you write Canal).

Who is the France president? / Where is the Panama Canal?

All these questions are equivalent:

Where is the Panama canal?
Who is the US president?
Who is the United States president?

Actually we do not merge (Panama <> canal, US <> president, United States <> president).

Maybe we could improve it ourself (can it be trained?).

yes

Otherwise we could do some "wikidata NER" in preprocessing: scanning the sentence, and when we see a group of words which represent a wikidata item or alias we put them into quotation marks. For instance, we would do Who is Louis XIV? → Who is “Louis XIV”? and Who is the France president? → Who is the “France” “president”?.

yes, see https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/issues/64 and https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/issues/85 (and i propose to close this issue since 2 other ones are opened on the same topic)

yhamoudi commented 9 years ago

How to update its version of the stanford parser:

install java 8
run https://github.com/ProjetPP/Scripts/blob/master/bootstrap_corenlp.sh (git pull before)
in Scripts, start the servor with CORENLP="stanford-corenlp-full-2015-01-30" CORENLP_OPTIONS="-parse.flags \" -makeCopulaHead\"" python3 -m corenlp

yhamoudi commented 9 years ago

See https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/issues/123

Ezibenroc commented 9 years ago

If you use the latest version of the Stanford parser, both Louis and XIV are tagged LOCATION, and so we obtain the right triple.

I have the latest version of the Stanford Parser. In the question "Who is Louis XIV?", "Louis" and "XIV" are tagged PERSON. In the question "Who is the daughter of Louis XIV?", "Louis" is tagged LOCATION and "XIV" is not tagged. And even is "XIV" was tagged LOCATION, we do not want to use this, since the tag is wrong.

According to StackExchange, the only correct form without using "of" is "Who is France's president?".

yhamoudi commented 9 years ago

In the question "Who is the daughter of Louis XIV?", "Louis" is tagged LOCATION and "XIV" is not tagged.

It's strange. I add it to deep_tests, let's travis decides: https://travis-ci.org/ProjetPP/PPP-QuestionParsing-Grammatical/builds/51077581

yhamoudi commented 9 years ago

Travis is with me :) Are you sure that you run the latest version: CORENLP="stanford-corenlp-full-2015-01-30" CORENLP_OPTIONS="-parse.flags \" -makeCopulaHead\"" python3 -m corenlp instead of CORENLP="stanford-corenlp-full-2014-08-27" CORENLP_OPTIONS="-parse.flags \" -makeCopulaHead\"" python3 -m corenlp?

Ezibenroc commented 9 years ago

My bad, I had the two installations in conflict...

Ezibenroc commented 9 years ago

According to StackExchange, "Who is the France president" is incorrect.

I think we whould use our previous heuristic for nn dependency: always merge. I reopen the issue since it is no more a problem of NER.

yhamoudi commented 9 years ago

What about the following ones:

Who is the US president?
Who is the United States president?

We need to be sure that these questions are incorrect and will not be use in practice by the users

yhamoudi commented 9 years ago

seems not used except for US ...

yhamoudi commented 9 years ago

Who is the French president? (not an nn relation) > lemmatization is not able to convert french into france

Ezibenroc commented 9 years ago

Same thing, you juste replaced "France" by "US" and "United States"...

We do not have to handle incorrect sentences (for the same reason, we do not have any spell-checker within our module: we suppose the input sentence to be correct).

Moreover, this sentences seems to be very odd to the native speakers, so they should not be asked very often.

Ezibenroc commented 9 years ago

Who is the French president? (not an nn relation) > lemmatization is not able to convert french into france

This is not the subject of this issue...

yhamoudi commented 9 years ago

Fixed https://github.com/ProjetPP/PPP-QuestionParsing-Grammatical/commit/3189e903cccfe888da64881d40df084a2fad8c9c

Now we produce (Panama canal, location, ?)

ProjetPP / PPP-QuestionParsing-Grammatical

NN dependency #122