No text extraction for powerpoint slides when 'squishable' is not set.

computerline1z / okapi

Automatically exported from code.google.com/p/okapi

0 stars 0 forks source link

No text extraction for powerpoint slides when 'squishable' is not set. #319

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

1.Create a new pptx file
2.use the openXmlFilter to display all the text units

With okapi-lib 0.19, the text units displayed are only those contained into the 
MasterSlide, not the usable slides.

When investigating, I saw that slides doc typed "slide+xml" are read as 
Document_Part, and not subdocument. In the source 
net/sf/okapi/filters/openxml/OpenXMLFilter.java , I think line 649, documents 
"notesSlide+xml" and "slideMaster+xml" are translated but not "slide+xml".

Original issue reported on code.google.com by aurelien...@gmail.com on 22 Mar 2013 at 10:53

GoogleCodeExporter commented 9 years ago

I can't reproduce the problem.
I've tried with a PPTX file with 'normal slides' and they get extracted.
The "slide+xml" type for those files seems to be handled in line 640.
If  you could post an example file where the problem occurs it could help.
Thanks,
-yves

Original comment by yves.sav...@gmail.com on 22 Mar 2013 at 12:00

Added labels: Component-OOXMLFilter
Removed labels: Component-

GoogleCodeExporter commented 9 years ago

I tried with this PPTX found into internet.
the OpenXmlFilter opens the zip files, and then reads correctly the 
[Content_Types].xml, and the file /ppt/slideMasters/slideMaster1.xml, but all 
the files into /ppt/slides are considered as "Document part", and not read.

PS: i tired with okapi-lib v0.19.
http://code.google.com/p/okapi/source/browse/okapi/filters/openxml/src/main/java
/net/sf/okapi/filters/openxml/OpenXMLFilter.java?name=m19
Into this file version, i can't see the handler for "slide+xml" type

Original comment by aurelien...@gmail.com on 22 Mar 2013 at 1:05

Attachments:

resume-cover-letter-preparation-2011.pptx

GoogleCodeExporter commented 9 years ago

> PS: i tired with okapi-lib v0.19.
> Into this file version, i can't see the handler for "slide+xml" type

Line 606 in that file.

Thanks for the example file. I'll try it.

Original comment by yves.sav...@gmail.com on 22 Mar 2013 at 1:38

GoogleCodeExporter commented 9 years ago

Maybe we fixed something since M19, but M21-snapshot seems to be extracting 
that file properly (see pseudo-translated output).
I haven't tried with M20 (which is the current release)

Original comment by yves.sav...@gmail.com on 22 Mar 2013 at 1:45

Attachments:

test319_1.out.pptx

GoogleCodeExporter commented 9 years ago

Thanks!
In fact, if i put the boolean "bSquishable" to true, the line 606 is 
accessible, but if i turn this boolean to false, then the line 606 is never 
reached. Then, i don't know if it considered as a bug for this version...
Thanks for the help

Original comment by aurelien...@gmail.com on 22 Mar 2013 at 1:48

GoogleCodeExporter commented 9 years ago

Mmm.. I'm not sure why an option about optimizing the text runs is tested there.
That variable seems also set to true evrywhere.
It looks like there is something fishy about this part of the code.
I'll keep the issue open for now.
Thanks for the input/feedback.
-ys

Original comment by yves.sav...@gmail.com on 22 Mar 2013 at 1:58

Changed title: No text extraction for powerpoint slides when 'squishable' is not set.

GoogleCodeExporter commented 9 years ago

I changed the bSquishable test.

Original comment by twbgaze...@gmail.com on 24 Jul 2013 at 7:24

Changed state: Fixed