computerline1z / okapi

Automatically exported from code.google.com/p/okapi
0 stars 0 forks source link

o:gfxdata tag results in loop taking too long to process (char by char) #351

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
From email:

> I have a DOCX file which has a o:gfxdata tag.  The OpenXMLContentFilter is 
caught in an infinite loop in the run method of the thread created in the 
combineRepeatedFormat method.  The read keeps returning the same cbuf value.
> 
>   public void run()
>         {
>           try
>           {
>             while((n=br.read(cbuf,0,512))!=-1)
>             {
>               for(i=0;i<n;i++)
>               {
>                   handleOneChar(cbuf[i]);
>               }
>             }
> 
> The document I am processing cannot be released to the public, so I am trying 
to create an example for this group.  

Original issue reported on code.google.com by mrh...@gmail.com on 11 Jul 2013 at 6:42

Attachments:

GoogleCodeExporter commented 9 years ago
I tried this with 0.22-SNAPSHOT

Original comment by mrh...@gmail.com on 11 Jul 2013 at 6:42

GoogleCodeExporter commented 9 years ago
The issue is not an endless loop.  In fact this document takes 17 min to 
process because all the data in the image gets processed one character at a 
time.
My original document had mutliple documents.  I did leave my original all night 
and it still didn't finish extracting.

Perhaps some work needs to be put into  o:gfxdata tags, so that the data 
portion is skipped and not processed one at a time.  This will speed up 
extraction.

Original comment by mrh...@gmail.com on 11 Jul 2013 at 6:45

GoogleCodeExporter commented 9 years ago
Correction, "My original document had mutliple documents" should read "My 
original document had mutliple images (30+)".

Original comment by mrh...@gmail.com on 11 Jul 2013 at 6:55

GoogleCodeExporter commented 9 years ago

Original comment by yves.sav...@gmail.com on 17 Aug 2013 at 1:17

GoogleCodeExporter commented 9 years ago
Running in tikal (tikal.sh -fc okf_openxml -x neverending.docx) only takes me 
36s, but 90+% of the time is spent in the method mrhcon identified.  At least 
half of that is OpenXMLContentFilter line 383:
>                       curtag = curtag + c;

That's a simple "use a StringBuilder instead" problem.  I will work up a patch.

Original comment by tingley on 14 Nov 2013 at 12:39

GoogleCodeExporter commented 9 years ago
Fixed on dev, commit 11cb1ffdaf4bc2eb2fb383feb24af4c467658c16.

A roundtrip of this file (filter + merge) went from about 50 seconds to < 2 on 
my machine.

Original comment by tingley on 14 Nov 2013 at 4:37

GoogleCodeExporter commented 9 years ago
Great! Thanks.

Original comment by yves.sav...@gmail.com on 14 Nov 2013 at 5:01