BCDH / standOffConverter4DARIAH-Campus

This is a work in progress. Once complete, this course will be published on DARIAH-Campus
Creative Commons Zero v1.0 Universal
1 stars 0 forks source link

annotating choice elements #1

Open sinairusinek opened 4 months ago

sinairusinek commented 4 months ago

Using the SOC python notebook, spacy did a good job annotating the following phrase with entities: a letter from the King of <placeName type="gpe">Jerusalem</placeName>, i.e. <persName>John de Brienne</persName>

However, when we have a choice element that looks like this: a letter from the King of Jerusalem, i.e. John de <choice><sic>Brinn</sic><corr>Brienne</corr></choice>

(see tei-c choice)

When stripping the elements and printing the plain text, SOC printed "John de BrinBrienne", and when it exported the xml at the end it went back to the proper choice structure. However, the entity John de Brienne was not annotated.

Any idea why?

I would expect the following result:

a letter from the King of <placeName type="gpe">Jerusalem</placeName>, i.e. <persName>John de <choice><sic>Brinn</sic><corr>Brienne</corr></choice></persName>

millawell commented 4 months ago

There are two parts to the answer:

  1. the standoff converter can exclude parts of the text based on the surrounding tags. There is a exclude_inside https://standoffconverter.readthedocs.io/en/latest/api.html#standoffconverter.View.exclude_inside to exclude all text inside any of the specific tags. In your case you might want the <sic> part not come up in the output plain text:
    view = View(so).exclude_inside("{http://www.tei-c.org/ns/1.0}sic").shrink_whitespace()

Afterwards, the plain text looks better and spacy also recognizes the entity as PERSON.

  1. However, with depth=None the
    so.add_inline(
                begin=start_ind,
                end=end_ind,
                tag=tags_dict[label]['tag'],
                depth=None,
                attrib=tags_dict[label]['attr']
            )

    does not find a unique context (that's what your error message will print there). That is because from the original TEI we see:

    John de <choice><sic>brinn</sic><corr>Brienne</corr></choice>

    the first part is at a certain depth

    [<Element {http://www.tei-c.org/ns/1.0}text at 0x39fc80940>,
    <Element {http://www.tei-c.org/ns/1.0}front at 0x16ca1b940>,
    <Element {http://www.tei-c.org/ns/1.0}div at 0x16bf22100>,
    <Element {http://www.tei-c.org/ns/1.0}p at 0x39fc1eec0>]

and the Brienne is two depths further.

[<Element {http://www.tei-c.org/ns/1.0}text at 0x39fc80940>,
 <Element {http://www.tei-c.org/ns/1.0}front at 0x16ca1b940>,
 <Element {http://www.tei-c.org/ns/1.0}div at 0x16bf22100>,
 <Element {http://www.tei-c.org/ns/1.0}p at 0x39fc1eec0>,
 <Element {http://www.tei-c.org/ns/1.0}choice at 0x39faffbc0>,
 <Element {http://www.tei-c.org/ns/1.0}corr at 0x39faffd40>]

with depth=None, the standoff converter will try to add a tag at the deepest position which will fail because it would break the tree property of the XML:

<persName>John de <choice><sic>brinn</sic><corr>Brienne</persName></corr></choice>

So in this particular case, it would be possible to add it explicitly at depth 4:

so.add_inline( ..., depth=4, ... )

But as a more general approach, we could do an add_span here.

ElectricFrogy commented 4 months ago

Code has been fixed according to your explanation.

view = View(so).shrink_whitespace() Was replaced with view = View(so).exclude_inside("{http://www.tei-c.org/ns/1.0}sic").shrink_whitespace()

And this loop is now being used for the XML annotation:

`# Annotate the named entities in the XML content for i, ent in enumerate(doc.ents): start_ind = view.get_table_pos(ent.start_char) end_ind = view.get_table_pos(ent.endchar) label = ent.label

print(f'{i} {start_ind=}\t{end_ind=}\t{label=}')

if label not in tags_dict.keys():
    print(label, '- not in dictionary -> IGNORED')
    continue
else:
    try:
        # Use the specified depth to avoid breaking the XML structure
        so.add_inline(
            begin=start_ind,
            end=end_ind,
            tag=tags_dict[label]['tag'],
            depth=4,  # Explicitly setting the depth to 4 as suggested
            attrib=tags_dict[label]['attr']
        )
    except Exception as e:
        print(e)`