Closed Ezibenroc closed 9 years ago
If it's sure that strange words don't disturb the stanford parser (we should make test with really long strange words), then you can perform the quotation merging very easily:
Where is the "city of light"?
_
can work) : Where is the "city_of_light"
city_of_light -> city of light
Thus, we avoid temporary words such as foo37
and we don't have to store extra information.
It forbids the user to use this special character in its quotations, which is sad.
I'm sure we can find a string the user is not supposed to use, @*$
for instance (no result on google)
We can generate randomly this symbol/string and we check that it does not appear in the sentence (otherwise, we regenerate it).
you will have to store for each quotation which string has been used to merge it (otherwise, you don't know how to split it). And someone can still use all the possible strings of our random set.
You really want to take care of the only user in the world that will make a quotation with @*$
in ? :)
You really want to take care of the only user in the world that will make a quotation with @*$ in ? :)
Since our code is readable by everyone, the probability that someone use @*$
in a quotation to make our module fail (e.g. at the public demo, to make a "little joke") is non-null :wink:
Fixed in #52
The sentence
Who is the author of "Let It Be" and "Lucy in the Sky with Diamonds"?
produces the following triple, which is wrong:Whereas the sentence
Who is the author of "Foundation" and "Robot"?
produces the following triple, which is ok:These sentences have exactly the same grammatical structure. It shows the need to handle differently the quotations. The idea @yhamoudi suggested is to replace them by some unique word in the sentence given to Stanford CoreNLP, and replace these unique words by the original quotations at the end. Replacing the first quotation by
foo37
and the second byfoo42
produces the expected triple: it seems non-existing words do not perturbate the Stanford library.