documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.com/docsplit/
Other
832 stars 214 forks source link

Bug where strange text is being overlaid to extracted image (pptx to png) #69

Open avlakin opened 11 years ago

avlakin commented 11 years ago

I am having an issue with docsplit adding staring text when converting some PPTX files to PNGs.

It appears that some text is being superimposed on top of some images inside the slides (always at the top left corner).

Using LibreOffice 3.5.

Really appreciate any help on this!

knowtheory commented 11 years ago

That is pretty interesting! It's going to be an issue with the way that OpenOffice or LibreOffice are converting your pptx's to pdf prior to extracting the images.

Are you using OpenOffice or LibreOffice?

hderms commented 11 years ago

Ted do you think that upgrading LibreOffice will have an effect on this bug? I'd like to spend some effort trying to solve it.

Also, have there been any reports of regular PPTs causing this behavior? In my experience it was one PPTX which exhibited symptoms. On Mar 25, 2013 10:20 AM, "Ted Han" notifications@github.com wrote:

That is pretty interesting! It's going to be an issue with the way that OpenOffice or LibreOffice are converting your pptx's to pdf prior to extracting the images.

Are you using OpenOffice or LibreOffice?

— Reply to this email directly or view it on GitHubhttps://github.com/documentcloud/docsplit/issues/69#issuecomment-15399411 .

avlakin commented 11 years ago

@knowtheory I'm pretty sure I'm using Open Office because I see the "soffice.bin" process run when it starts converting. Also, I just realized my gem is outdated - "Docsplit version 0.6.3", I will try it with the updated gem.

Is there an easy way to force DocSplit to use LibreOffice?

I haven't noticed this occurring with regular PPTs.

Edit: I did "soffice --version" and got "LibreOffice 3.5", so I guess I'm using Libre Office.

Edit2: Here is the PPTX page thats causing the problem. When converted to PPT, there is no longer an issue.

https://s3.amazonaws.com/alakincld/tmp/singlepc.pptx

Edit3: Updating to LibreOffice 4 resolved the issue. Also had to use Senner's pull req for issue 68 to make sure the pdf_extractor can find the updated LibreOffice install.