Raw content of pdf after conversion from pdf to docx

documents4j / documents4j

documents4j is a Java library for converting documents into another document format

http://documents4j.com

Apache License 2.0

552 stars 143 forks source link

Raw content of pdf after conversion from pdf to docx #78

Open pawelkrysaplagiat opened 4 years ago

pawelkrysaplagiat commented 4 years ago

Hello,

i am trying to convert pdf file to docx but as result a get raw pdf content.

%PDF-1.5 %µµµµ 1 0 obj <</Type/Catalog/Pages 2 0 R/Lang(uk-UA) /StructTreeRoot 63 0 R/MarkInfo<</Marked true>>>> endobj 2 0 obj <</Type/Pages/Count 8/Kids[ 3 0 R 36 0 R 38 0 R 40 0 R 42 0 R 44 0 R 53 0 R 55 0 R] >> endobj 3 0 obj <</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 10 0 R/F3 12 0 R/F4 15 0 R/F5 20 0 R/F6 22 0 R/F7 27 0 R/F8 32 0 R/F9 34 0 R>>/XObject<</Image14 14 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.32 841.92] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>> endobj 4 0 obj <</Filter/FlateDecode/Length 3862>>

I use server-standalone 1.1.2, client 1.1.2 and office 2013. Conversion from docx to pdf works correctly

sorcio46 commented 4 years ago

Same issue here while trying to convert from DocumentType PDF to DocumentType DOCX/HTML.

Using version 1.1.1 taken from mvn repository.

raphw commented 4 years ago

Does it work with a previous version? I assume that the conversion script is not behaving as expected. You can have a look at the VBS script in the ms-word-bridge, execute it and see if you can adjust it to make it work.

harrywynn commented 4 years ago

I'm seeing the same issue here as well, going from PDF to DOCX or TXT. Any updates or solutions/workarounds? I've tried with both 1.1.1 and 1.0.3, input/output is attached.

test.pdf out.docx out.txt

raphw commented 4 years ago

I'd recommend trying to run the script as suggested before. If you find a solution by tweaking the arguments that are presented to the MS Word, I am happy to merge that.

harrywynn commented 4 years ago

@raphw the script isn't the issue. I can run it manually from the command line and convert files without any problems. What it looks like is happening is that the PDF is ending up partially corrupted once it hits the server, so Word is choking when trying to convert it to other formats. I wasn't able to determine why that was happening after looking at it for a number of hours today.

raphw commented 4 years ago

That's strange since all code despite the bridge is fully unavare of the data being transported. Does the same happen when using the LocalConverter?