ThiagoCF05 / webnlg

The enriched version of the WebNLG described at INLG 2018
68 stars 22 forks source link

Nice dataset! Question regarding: Segmentation sentences in the alignment between "sortedtripleset" and original text #16

Open oplatek opened 1 year ago

oplatek commented 1 year ago

Thank you for making the WebNLG dataset with the alignment available!

We would like to align sentences in the original text and the triples in sortedtripleset.

Is there a function/procedure which replicates the segmentation perfectly?

Here is the example from the README to ground what I mean by the original text and sortedtripleset.

...
<lex comment="good" lid="Id1">
        <!-- ordered tripleset segmented in sentences -->
        <sortedtripleset>
            <sentence ID="1">
                <striple>11th_Mississippi_Infantry_Monument | location | Adams_County,_Pennsylvania</striple>
            </sentence>
            <sentence ID="2">
                <striple>11th_Mississippi_Infantry_Monument | established | 2000</striple>
                <striple>11th_Mississippi_Infantry_Monument | category | Contributing_property</striple>
            </sentence>
        </sortedtripleset>
        <!-- extracted referring expressions -->
        <references>
            <reference entity="11th_Mississippi_Infantry_Monument" number="1" tag="AGENT-1" type="description">The 11th Mississippi Infantry Monument</reference>
            <reference entity="Adams_County,_Pennsylvania" number="2" tag="PATIENT-1" type="name">Adams County , Pennsylvania</reference>
            <reference entity="11th_Mississippi_Infantry_Monument" number="3" tag="AGENT-1" type="pronoun">It</reference>
            <reference entity="2000" number="4" tag="PATIENT-2" type="name">2000</reference>
            <reference entity="Contributing_property" number="5" tag="PATIENT-3" type="name">contributing property</reference>
        </references>
        <!-- original text -->
        <text>
            The 11th Mississippi Infantry Monument which is located in Adams County, Pennsylvania. It was established in 2000 and falls under the category of contributing property.
        </text>
...
oplatek commented 1 year ago

At the moment, we use the number of "sortedtripleset" sentences i.e. the highest sentence ID as checksum. We were able to segment ~98% of original texts into sentences so number of sentences matches number of sentences referenced in "sortedtripleset". However, it is only a heuristic.