Quotations considered harmful

Ezibenroc commented 9 years ago

The sentence Who is the author of "Let It Be" and "Lucy in the Sky with Diamonds"? produces the following triple, which is wrong:

{
    "object": {
        "type": "missing"
    }, 
    "subject": {
        "list": [
            {
                "type": "resource", 
                "value": "Let It Be"
            }, 
            {
                "type": "resource", 
                "value": "Lucy in the Sky with Diamonds"
            }
        ], 
        "type": "intersection"
    }, 
    "type": "triple", 
    "predicate": {
        "type": "resource", 
        "value": "author"
    }
}

Whereas the sentence Who is the author of "Foundation" and "Robot"? produces the following triple, which is ok:

{
    "list": [
        {
            "subject": {
                "value": "Foundation", 
                "type": "resource"
            }, 
            "type": "triple", 
            "object": {
                "type": "missing"
            }, 
            "predicate": {
                "value": "author", 
                "type": "resource"
            }
        }, 
        {
            "subject": {
                "value": "Robot", 
                "type": "resource"
            }, 
            "type": "triple", 
            "object": {
                "type": "missing"
            }, 
            "predicate": {
                "value": "author", 
                "type": "resource"
            }
        }
    ], 
    "type": "intersection"
}

These sentences have exactly the same grammatical structure. It shows the need to handle differently the quotations. The idea @yhamoudi suggested is to replace them by some unique word in the sentence given to Stanford CoreNLP, and replace these unique words by the original quotations at the end. Replacing the first quotation by foo37 and the second by foo42 produces the expected triple: it seems non-existing words do not perturbate the Stanford library.

yhamoudi commented 9 years ago

If it's sure that strange words don't disturb the stanford parser (we should make test with really long strange words), then you can perform the quotation merging very easily:

take the sentence just before parsing it : Where is the "city of light"?
merge all quotations using a special character (perhaps _ can work) : Where is the "city_of_light"
parse the sentence using the stanford parser
go into the tree and for each QUOTATION node, expand the word contains into it : city_of_light -> city of light

Thus, we avoid temporary words such as foo37 and we don't have to store extra information.

Ezibenroc commented 9 years ago

It forbids the user to use this special character in its quotations, which is sad.

yhamoudi commented 9 years ago

I'm sure we can find a string the user is not supposed to use, @*$ for instance (no result on google)

Ezibenroc commented 9 years ago

We can generate randomly this symbol/string and we check that it does not appear in the sentence (otherwise, we regenerate it).

yhamoudi commented 9 years ago

you will have to store for each quotation which string has been used to merge it (otherwise, you don't know how to split it). And someone can still use all the possible strings of our random set.

You really want to take care of the only user in the world that will make a quotation with @*$ in ? :)

Ezibenroc commented 9 years ago

You really want to take care of the only user in the world that will make a quotation with @*$ in ? :)

Since our code is readable by everyone, the probability that someone use @*$ in a quotation to make our module fail (e.g. at the public demo, to make a "little joke") is non-null :wink:

Ezibenroc commented 9 years ago

Fixed in #52

ProjetPP / PPP-QuestionParsing-Grammatical

Quotations considered harmful #51