dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
195 stars 67 forks source link

delete function for AlignedString does not work properly #1482

Open topn0tch opened 4 years ago

topn0tch commented 4 years ago

With the following unit test (AlignedStringTest.java)

@Test
    public void testHtmlDelete() {
        ArrayList<ImmutableInterval> list = new ArrayList<>();
        list.add(new ImmutableInterval(0, 3));
        list.add(new ImmutableInterval(8, 11));
        list.add(new ImmutableInterval(16, 20));
        list.add(new ImmutableInterval(20, 24));
        Collections.reverse(list);
        String short_html = "<p>Hello<p>World</p></p>";
        AlignedString base = new AlignedString(short_html);
        for (ImmutableInterval i : list) {
            base.delete(i.getStart(), i.getEnd());
        }
        System.out.println("Base      : " + base.get() + " - " + base.dataSegmentsToString());
        Collections.reverse(list);
        assertEquals(new ImmutableInterval(0, 0), base.inverseResolve(list.get(0)));
        assertEquals(new ImmutableInterval(5, 5), base.inverseResolve(list.get(1)));
        assertEquals(new ImmutableInterval(10, 10), base.inverseResolve(list.get(2)));
        assertEquals(new ImmutableInterval(10, 10), base.inverseResolve(list.get(3)));
    }

We get the following output:

Base      : HelloWorld - >>[][Hello][World][][]<< (A:0)(O:0[]0)(O:0[Hello]5)(O:5[World]10)(O:10[]10)(O:10[]10)(A:10)

java.lang.AssertionError: 
Expected :[5-5]
Actual   :[10-10]

Assuming that the empty brackets represent the html tags, we've noticed that the "p" tag between Hello and World in the output is missing.

reckart commented 3 years ago

Thanks for reporting.