dkpro / dkpro-cassis

UIMA CAS processing library written in Python
https://pypi.org/project/dkpro-cassis/
Apache License 2.0
84 stars 22 forks source link

Error in cassi.xmi.load_cas_from_xmi when deserializing large xmi files #153

Closed ArneDefauw closed 3 years ago

ArneDefauw commented 3 years ago

Describe the bug Can not deserialize a large xmi file created using UIMA.

To Reproduce Steps to reproduce the behavior:

When trying to deserialize the following xmi file https://drive.google.com/file/d/1WZS3Ep67O7BluLBd4NANrQXKanmNakYk/view?usp=sharing With Typesystem: https://drive.google.com/file/d/1hJVC9wepQAoYhMteEaXMPFnhQX2OZU0I/view?usp=sharing

Via:

from cassis.typesystem import load_typesystem
from cassis.xmi import load_cas_from_xmi
with open( "typesystem.xml" , 'rb') as f:
    TYPESYSTEM = load_typesystem(f)

cas=load_cas_from_xmi( \
open( "large_file.xmi"   , 'rb'),\
                      typesystem=TYPESYSTEM )

I get following error message:

_File "/miniconda/lib/python3.7/site-packages/cassis/xmi.py", line 42, in load_cas_from_xmi return deserializer.deserialize(source, typesystem=typesystem, lenient=lenient)

File "/miniconda/lib/python3.7/site-packages/cassis/xmi.py", line 75, in deserialize for event, elem in context:

File "src/lxml/iterparse.pxi", line 209, in lxml.etree.iterparse.next

File "src/lxml/iterparse.pxi", line 194, in lxml.etree.iterparse.next

File "src/lxml/iterparse.pxi", line 229, in lxml.etree.iterparse._read_more_events

File "src/lxml/parser.pxi", line 1384, in lxml.etree._FeedParser.feed

File "src/lxml/parser.pxi", line 606, in lxml.etree._ParserContext._handleParseResult

File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc

File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult

File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError

File ".../large.xmi", line 85622 <cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="<div id="text" class="panel-body"> <div id="textTabContent"> <div id=&q

...

XMLSyntaxError: internal error: Huge input lookup, line 85622, column 5_

The error is probably caused by https://stackoverflow.com/questions/48984325/lxml-etree-xmlsyntaxerror-internal-error-huge-input-lookup , https://stackoverflow.com/questions/11850345/using-python-lxml-etree-for-huge-xml-files

and line 69 in https://github.com/dkpro/dkpro-cassis/blob/master/cassis/xmi.py

 context = etree.iterparse(source, events=("start", "end")) 
jcklie commented 3 years ago

Thanks for reporting! I added an option to load_from_xmi that you can use, see https://github.com/dkpro/dkpro-cassis#large-xmi-files . It is in master and will be in the next release.

jcklie commented 3 years ago

I released a new version with the fix.