dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

TigerXmlReader produces wrong range when a target is noncontiguous #875

Open maxxkia opened 8 years ago

maxxkia commented 8 years ago

TigerXmlReader produces wrong begin and end index for target (SemPred) of a semantic frame when the target is noncontiguous.

For instance in the following sentence: w1 w2 w3 w4 w5 w6 w7

, if a target consists of w2 and w5 then the corresponding begin and end indexes for target will be wrongly set as:

target.begin = w2.begin;
target.end = w5.end;

To fix this issue:

reckart commented 8 years ago

Please also add an info about where you got the new tiger sample from to the NOTICE.txt file.

maxxkia commented 8 years ago

How should we deal neighbouring tokens? for example in w1 w2 w3 w4 w5 w6 w7 if a target is made up of w2 w3 which one is the correct assignment of begin and end for the first element? a)

begin = w2.begin
end = w3.end

b)

begin = w2.begin
end = w2.end

@reckart any suggestions?

reckart commented 8 years ago

I think for continuous spans, we can just extend the offsets. I think the only problem are non-continous spans, because we presently do not have a concept to represent these in DKPro Core.

maxxkia commented 8 years ago

@reckart The problem that I imagined to be discontinuous frame arguments (#895) turned out to be another issue.

Having the following example:

<frame name="SubjectiveExpression" id="s6_f2">
    <target>
        <fenode idref="s6_3"/>
        <fenode idref="s6_2"/>
    </target>
    <fe name="Source" id="s6_f2_e1">
        <flag name="Sprecher">
        </flag>
    </fe>
    <fe name="Target" id="s6_f2_e2">
        <fenode idref="s6_4"/>
        <fenode idref="s6_503"/>
        <fenode idref="s6_5"/>
    </fe>
</frame>

, when the reader processes the frame target (id="s6_f2_e2") it creates 3 instances of SemArgLink having the role set to Target and each linking to an instance of SemArg representing the annotation covered by each of fenodes (i.e. s6_4, s6_503 and s6_5).

These SemArgLinks are accessible as arguments of a SemPred:

FSArray arguments = element.getArguments();

However since instances of SemArgLink belonging to a single argument are not stored in a unique collection, one has to iterate over all of them to identify the SemArgLink group. One solution to this would be to iterate over them and group them based on their frame name (i.e. Target in this case), whose value I'm not sure to be distinct (can there be two FE in a TigerXml file having the same name but different ids?). Also note that the frame id (i.e. s6_f2_e2), which can be used to uniquely identify arguments, is dropped in TigerXmlReader.

maxxkia commented 8 years ago

This problem was raised when I tried to identify the boundaries of sources and targets for subjective expressions.

reckart commented 8 years ago

can there be two fe in a TigerXml file having the same name but different ids?

In principle, yes. There could be two arguments with the same role name.

     <fe name="Target" id="s6_f2_e2">
        <fenode idref="s6_4"/>
        <fenode idref="s6_503"/>
        <fenode idref="s6_5"/>
    </fe>

I think that if these three are adjacent tokens, they should be merged into a single SemArg span. So if they are not adjacent tokens, then we have discontinuous SemArg. Does that make sense to you?

reckart commented 8 years ago

Only one SemArgLink & SemArg should IMHO be per FE.

maxxkia commented 8 years ago

Actually in this example and in many more examples I checked manually the constituents of a FE are adjacent and they can be merged. I should write a piece of code to see if there exists any discontinuous FE in my dataset.

Only one SemArgLink & SemArg should IMHO be per FE.

I agree, since I haven't yet seen any example violating this condition.

maxxkia commented 8 years ago

@reckart I could find discontinuous FE instances (look here in #895).