dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
84 stars 22 forks source link

XMI output encoding is ASCII instead of UTF-8 #163

Closed pottieral closed 3 years ago

pottieral commented 3 years ago

Thank you for providing these tools to the community.

When using the to_xmi() function, output is encoded in ASCII (which seems to be the default encoding of the underlying module lxml). As I understood it reading the docs and the xmi.py code on the repo, we should have an UTF-8 coded output. This would not necessarily be a huge problem, but I think underlying offset functions rely on that assumption. As a hack, I changed it for now by hand in the xmi.py script, adding an encoding option to the doc.write() function, line 320.

Here is the code I used to generate a minimal example.

cas = Cas(typesystem=typesystem)
NERType = cas.typesystem.get_type(
        "de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity")
sentencetype = cas.typesystem.get_type("de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence")
clean_list = ["Voici une phrase d'exemple un 1 800 700 euros.","Et une autre phrase pour tester à Paris."]
nlp = create_nlp_pipeline(modele_choisi)
current_i = 0
total_text = ""
for sent in clean_list:
    sent_annotation = sentencetype(begin=current_i+1,
                                 end=current_i+len(sent))
    cas.add_annotation(sent_annotation)
    doc = nlp(sent)
    for token in doc:
        ner_annotation = NERType(begin=current_i+token["start"],
                                 end=current_i+token["end"],
                                 value=token["entity_group"])
        cas.add_annotation(ner_annotation)
    current_i+=len(sent)
    total_text+=sent
print(total_text)
cas.sofa_string = total_text
cas.sofa_mime = "text"

Result of the cas.to_xmi() command: '<?xml version=\'1.0\' encoding=\'ASCII\'?>\n<xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:cas="http:///uima/cas.ecore" xmlns:type="http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore" xmlns:type0="http:///de/tudarmstadt/ukp/dkpro/core/api/ner/type.ecore" xmi:version="2.0"><cas:NULL xmi:id="0"/><type:Sentence xmi:id="2" begin="1" end="46" sofa="1"/><type0:NamedEntity xmi:id="3" value="MONEY" begin="30" end="45" sofa="1"/><type:Sentence xmi:id="4" begin="47" end="86" sofa="1"/><type0:NamedEntity xmi:id="5" value="GPE" begin="80" end="85" sofa="1"/><cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="Voici une phrase d\'exemple un 1 800 700 euros.Et une autre phrase pour tester &#224; Paris."/><cas:View sofa="1" members="2 3 4 5"/></xmi:XMI>'

Expected behavior of the to_xmi() command: '<?xml version=\'1.0\' encoding=\'UTF-8\'?> (etc etc)

Attached file TypeSystem.xml as a .txt : TypeSystem-txt.txt

Currently using jupyter notebook on dataiku for my testing script. Please excuse any error or github etiquette mistakes.

reckart commented 3 years ago

On my local dev version, I had no problem importing the XMI.

Screenshot 2021-04-28 at 19 12 03

I slightly adjusted the XML document you posted above, replacing escaped characters with their regular counterparts:

<?xml version='1.0' encoding='ASCII'?>
<xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:cas="http:///uima/cas.ecore" xmlns:type="http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore" xmlns:type0="http:///de/tudarmstadt/ukp/dkpro/core/api/ner/type.ecore" xmi:version="2.0"><cas:NULL xmi:id="0"/><type:Sentence xmi:id="2" begin="1" end="46" sofa="1"/><type0:NamedEntity xmi:id="3" value="MONEY" begin="30" end="45" sofa="1"/><type:Sentence xmi:id="4" begin="47" end="86" sofa="1"/><type0:NamedEntity xmi:id="5" value="GPE" begin="80" end="85" sofa="1"/><cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="Voici une phrase d'exemple un 1 800 700 euros.Et une autre phrase pour tester &#224; Paris."/><cas:View sofa="1" members="2 3 4 5"/></xmi:XMI>
jcklie commented 3 years ago

I will change it so that cassis will always write utf8 in the header

jcklie commented 3 years ago

I can reproduce that it sets the encoding to ASCII, but I cannot reproduce the escaping of the d'example

image

Left is mine when running cassis locally, right is the one you posted above.

I do not know where that comes from. I will change cassis though that it always writes a utf-8 header

pottieral commented 3 years ago

Thank you for your answers! Do not worry the escaping is only present on the printed output of the to_xmi() command in the jupyter notebook, when exported on disk it is not present. (example below)

<?xml version='1.0' encoding='UTF-8'?>
<xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:cas="http:///uima/cas.ecore" xmlns:type="http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore" xmlns:custom="http:///webanno/custom.ecore" xmi:version="2.0">
  <cas:NULL xmi:id="0"/>
  <type:Sentence xmi:id="2" begin="0" end="36" sofa="1"/>
  <type:Sentence xmi:id="3" begin="36" end="43" sofa="1"/>
  <type:Sentence xmi:id="4" begin="43" end="79" sofa="1"/>
  <custom:NERMagbert xmi:id="5" entity="DATE" begin="71" end="74" sofa="1"/>
  <custom:NERMagbert xmi:id="6" entity="DATE" begin="74" end="76" sofa="1"/>
  <custom:NERMagbert xmi:id="7" entity="EVENT" begin="76" end="77" sofa="1"/>
(etc etc)