ProjetPP / PPP-QuestionParsing-Grammatical

Question Parsing module for the PPP using a grammatical approch
GNU Affero General Public License v3.0
33 stars 11 forks source link

Quotations considered harmful #51

Closed Ezibenroc closed 9 years ago

Ezibenroc commented 9 years ago

The sentence Who is the author of "Let It Be" and "Lucy in the Sky with Diamonds"? produces the following triple, which is wrong:

{
    "object": {
        "type": "missing"
    }, 
    "subject": {
        "list": [
            {
                "type": "resource", 
                "value": "Let It Be"
            }, 
            {
                "type": "resource", 
                "value": "Lucy in the Sky with Diamonds"
            }
        ], 
        "type": "intersection"
    }, 
    "type": "triple", 
    "predicate": {
        "type": "resource", 
        "value": "author"
    }
}

Whereas the sentence Who is the author of "Foundation" and "Robot"? produces the following triple, which is ok:

{
    "list": [
        {
            "subject": {
                "value": "Foundation", 
                "type": "resource"
            }, 
            "type": "triple", 
            "object": {
                "type": "missing"
            }, 
            "predicate": {
                "value": "author", 
                "type": "resource"
            }
        }, 
        {
            "subject": {
                "value": "Robot", 
                "type": "resource"
            }, 
            "type": "triple", 
            "object": {
                "type": "missing"
            }, 
            "predicate": {
                "value": "author", 
                "type": "resource"
            }
        }
    ], 
    "type": "intersection"
}

These sentences have exactly the same grammatical structure. It shows the need to handle differently the quotations. The idea @yhamoudi suggested is to replace them by some unique word in the sentence given to Stanford CoreNLP, and replace these unique words by the original quotations at the end. Replacing the first quotation by foo37 and the second by foo42 produces the expected triple: it seems non-existing words do not perturbate the Stanford library.

yhamoudi commented 9 years ago

If it's sure that strange words don't disturb the stanford parser (we should make test with really long strange words), then you can perform the quotation merging very easily:

Thus, we avoid temporary words such as foo37 and we don't have to store extra information.

Ezibenroc commented 9 years ago

It forbids the user to use this special character in its quotations, which is sad.

yhamoudi commented 9 years ago

I'm sure we can find a string the user is not supposed to use, @*$ for instance (no result on google)

Ezibenroc commented 9 years ago

We can generate randomly this symbol/string and we check that it does not appear in the sentence (otherwise, we regenerate it).

yhamoudi commented 9 years ago

you will have to store for each quotation which string has been used to merge it (otherwise, you don't know how to split it). And someone can still use all the possible strings of our random set.

You really want to take care of the only user in the world that will make a quotation with @*$ in ? :)

Ezibenroc commented 9 years ago

You really want to take care of the only user in the world that will make a quotation with @*$ in ? :)

Since our code is readable by everyone, the probability that someone use @*$ in a quotation to make our module fail (e.g. at the public demo, to make a "little joke") is non-null :wink:

Ezibenroc commented 9 years ago

Fixed in #52