Open CostinelDumitrescu opened 6 years ago
Did you try the conversion i Word directly and without documents4j? If you get the same result this is most likely a Word issue and has nothing to do with documents4j.
What do you mean by directly in Word ?
If you open word and convert the document manually, how does it look like? documents4j merely automates that process.
If I open that test.pdf file with MS Word and then SAVE_AS test.docx, it looks fine.
It seems like the way that the PDF file is opened does not detect it as a PDF but parses its raw format.
You can try to build documents4j and play with the parameters set in https://github.com/documents4j/documents4j/blob/master/documents4j-transformer-msoffice/documents4j-transformer-msoffice-word/src/main/resources/word_convert.vbs to see if there is another option in VBS that solves this issue.
Cross-posted to Stack Overflow.
Thank you very much for the help. I'm planning to use this on a linux platform. My question is, can LibreOffice be used instead of MS Word ? Is there any bridge towards LO already written ? Thank you again
As of today, there is not. But you are welcome to contribute an implementation. It should not be to hard tio do.
Hello there, thanks for this awesome library!
I have the same problem as @CostinelDumitrescu , the conversion PDF to DOCX is giving me some weird characters... Did you find any fix for this?
Encountered the same issue while transforming a PDF to DOCX. Upon debugging the code, I noticed peculiar behavior.
When the source is an InputStream, a temporary file is generated from the source before conversion, but without an extension (i.e., without the .pdf suffix). During the execution of the VBS script, if this file lacks the extension, it generates a DOCX with strange characters. However, manipulating it with the extension results in a correctly generated DOCX.
converter
.convert(inputStream, false)
.as(DocumentType.PDF)
.to(os)
.as(DocumentType.DOCX)
.execute();
(edited to add simple Java code sample)
I'm using this code to convert from a PDF file to a DOCX file. The conversion is successful but the DOCX file contains only weird characters (see bellow). The PDF has 1 page only and the resulted DOCX has more then 500 pages with such characters. I did the test for a lot of pdf files and the result is the same.
The conversion from DOCX to PDF is fine!
public static void main(String[] args) throws Exception { InputStream in = new BufferedInputStream(new FileInputStream("D:\JURIS\test_folder\pdf\sample_test.pdf")); ByteArrayOutputStream bo = new ByteArrayOutputStream();
Snippet from DOCX file :
%PDF-1.7 %µµµµ 1 0 obj <</Type/Catalog/Pages 2 0 R/Lang(en-US) /StructTreeRoot 23 0 R/MarkInfo<</Marked true>>/Metadata 55 0 R/ViewerPreferences 56 0 R>> endobj 2 0 obj <</Type/Pages/Count 1/Kids[ 3 0 R] >> endobj 3 0 obj <</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 9 0 R/F3 11 0 R/F4 13 0 R/F5 15 0 R>>/ExtGState<</GS7 7 0 R/GS8 8 0 R>>/XObject<</Image20 20 0 R/Image21 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.32 841.92] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>> endobj 4 0 obj <</Filter/FlateDecode/Length 1200>> stream xœ½XÛnÛH
Can anyone please help me with this ASAP ?