Closed pottieral closed 3 years ago
On my local dev version, I had no problem importing the XMI.
I slightly adjusted the XML document you posted above, replacing escaped characters with their regular counterparts:
<?xml version='1.0' encoding='ASCII'?>
<xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:cas="http:///uima/cas.ecore" xmlns:type="http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore" xmlns:type0="http:///de/tudarmstadt/ukp/dkpro/core/api/ner/type.ecore" xmi:version="2.0"><cas:NULL xmi:id="0"/><type:Sentence xmi:id="2" begin="1" end="46" sofa="1"/><type0:NamedEntity xmi:id="3" value="MONEY" begin="30" end="45" sofa="1"/><type:Sentence xmi:id="4" begin="47" end="86" sofa="1"/><type0:NamedEntity xmi:id="5" value="GPE" begin="80" end="85" sofa="1"/><cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="Voici une phrase d'exemple un 1 800 700 euros.Et une autre phrase pour tester à Paris."/><cas:View sofa="1" members="2 3 4 5"/></xmi:XMI>
I will change it so that cassis will always write utf8 in the header
I can reproduce that it sets the encoding to ASCII, but I cannot reproduce the escaping of the d'example
Left is mine when running cassis locally, right is the one you posted above.
I do not know where that comes from. I will change cassis though that it always writes a utf-8 header
Thank you for your answers! Do not worry the escaping is only present on the printed output of the to_xmi() command in the jupyter notebook, when exported on disk it is not present. (example below)
<?xml version='1.0' encoding='UTF-8'?>
<xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:cas="http:///uima/cas.ecore" xmlns:type="http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore" xmlns:custom="http:///webanno/custom.ecore" xmi:version="2.0">
<cas:NULL xmi:id="0"/>
<type:Sentence xmi:id="2" begin="0" end="36" sofa="1"/>
<type:Sentence xmi:id="3" begin="36" end="43" sofa="1"/>
<type:Sentence xmi:id="4" begin="43" end="79" sofa="1"/>
<custom:NERMagbert xmi:id="5" entity="DATE" begin="71" end="74" sofa="1"/>
<custom:NERMagbert xmi:id="6" entity="DATE" begin="74" end="76" sofa="1"/>
<custom:NERMagbert xmi:id="7" entity="EVENT" begin="76" end="77" sofa="1"/>
(etc etc)
Thank you for providing these tools to the community.
When using the to_xmi() function, output is encoded in ASCII (which seems to be the default encoding of the underlying module lxml). As I understood it reading the docs and the xmi.py code on the repo, we should have an UTF-8 coded output. This would not necessarily be a huge problem, but I think underlying offset functions rely on that assumption. As a hack, I changed it for now by hand in the xmi.py script, adding an encoding option to the doc.write() function, line 320.
Here is the code I used to generate a minimal example.
Result of the cas.to_xmi() command:
'<?xml version=\'1.0\' encoding=\'ASCII\'?>\n<xmi:XMI xmlns:xmi="http://www.omg.org/XMI" xmlns:cas="http:///uima/cas.ecore" xmlns:type="http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore" xmlns:type0="http:///de/tudarmstadt/ukp/dkpro/core/api/ner/type.ecore" xmi:version="2.0"><cas:NULL xmi:id="0"/><type:Sentence xmi:id="2" begin="1" end="46" sofa="1"/><type0:NamedEntity xmi:id="3" value="MONEY" begin="30" end="45" sofa="1"/><type:Sentence xmi:id="4" begin="47" end="86" sofa="1"/><type0:NamedEntity xmi:id="5" value="GPE" begin="80" end="85" sofa="1"/><cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="Voici une phrase d\'exemple un 1 800 700 euros.Et une autre phrase pour tester à Paris."/><cas:View sofa="1" members="2 3 4 5"/></xmi:XMI>'
Expected behavior of the to_xmi() command:
'<?xml version=\'1.0\' encoding=\'UTF-8\'?> (etc etc)
Attached file TypeSystem.xml as a .txt : TypeSystem-txt.txt
Currently using jupyter notebook on dataiku for my testing script. Please excuse any error or github etiquette mistakes.