Conversion from PDF to DOCX results weird characters ! I'm using documents4j

CostinelDumitrescu commented 6 years ago

I'm using this code to convert from a PDF file to a DOCX file. The conversion is successful but the DOCX file contains only weird characters (see bellow). The PDF has 1 page only and the resulted DOCX has more then 500 pages with such characters. I did the test for a lot of pdf files and the result is the same.

The conversion from DOCX to PDF is fine!

public static void main(String[] args) throws Exception { InputStream in = new BufferedInputStream(new FileInputStream("D:\JURIS\test_folder\pdf\sample_test.pdf")); ByteArrayOutputStream bo = new ByteArrayOutputStream();

    IConverter converter = LocalConverter   .builder()
                                            .baseFolder(new File("D:\\JURIS\\test_folder\\pdf\\"))
                                            .workerPool(20, 25, 2, TimeUnit.SECONDS)
                                            .processTimeout(5, TimeUnit.SECONDS)
                                            .build();

    Future<Boolean> conversion = converter  .convert(in).as(DocumentType.PDF)
                                            .to(bo).as(DocumentType.DOCX)
                                            .prioritizeWith(1000)
                                            .schedule();
    conversion.get();
    try (OutputStream outputStream = new FileOutputStream("D:\\JURIS\\test_folder\\pdf\\sample_test.docx")) {
        bo.writeTo(outputStream);
    }
    in.close();
    bo.close();

    converter.shutDown();
}

Snippet from DOCX file :

%PDF-1.7 %µµµµ 1 0 obj <</Type/Catalog/Pages 2 0 R/Lang(en-US) /StructTreeRoot 23 0 R/MarkInfo<</Marked true>>/Metadata 55 0 R/ViewerPreferences 56 0 R>> endobj 2 0 obj <</Type/Pages/Count 1/Kids[ 3 0 R] >> endobj 3 0 obj <</Type/Page/Parent 2 0 R/Resources<</Font<</F1 5 0 R/F2 9 0 R/F3 11 0 R/F4 13 0 R/F5 15 0 R>>/ExtGState<</GS7 7 0 R/GS8 8 0 R>>/XObject<</Image20 20 0 R/Image21 21 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI] >>/MediaBox[ 0 0 595.32 841.92] /Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>> endobj 4 0 obj <</Filter/FlateDecode/Length 1200>> stream xœ½XÛnÛH

Can anyone please help me with this ASAP ?

raphw commented 6 years ago

Did you try the conversion i Word directly and without documents4j? If you get the same result this is most likely a Word issue and has nothing to do with documents4j.

CostinelDumitrescu commented 6 years ago

What do you mean by directly in Word ?

raphw commented 6 years ago

If you open word and convert the document manually, how does it look like? documents4j merely automates that process.

CostinelDumitrescu commented 6 years ago

If I open that test.pdf file with MS Word and then SAVE_AS test.docx, it looks fine.

raphw commented 6 years ago

It seems like the way that the PDF file is opened does not detect it as a PDF but parses its raw format.

You can try to build documents4j and play with the parameters set in https://github.com/documents4j/documents4j/blob/master/documents4j-transformer-msoffice/documents4j-transformer-msoffice-word/src/main/resources/word_convert.vbs to see if there is another option in VBS that solves this issue.

halfer commented 6 years ago

Cross-posted to Stack Overflow.

CostinelDumitrescu commented 6 years ago

Thank you very much for the help. I'm planning to use this on a linux platform. My question is, can LibreOffice be used instead of MS Word ? Is there any bridge towards LO already written ? Thank you again

raphw commented 6 years ago

As of today, there is not. But you are welcome to contribute an implementation. It should not be to hard tio do.

rafaeljigau commented 2 years ago

Hello there, thanks for this awesome library!

I have the same problem as @CostinelDumitrescu , the conversion PDF to DOCX is giving me some weird characters... Did you find any fix for this?

ismakc commented 10 months ago

Encountered the same issue while transforming a PDF to DOCX. Upon debugging the code, I noticed peculiar behavior.

When the source is an InputStream, a temporary file is generated from the source before conversion, but without an extension (i.e., without the .pdf suffix). During the execution of the VBS script, if this file lacks the extension, it generates a DOCX with strange characters. However, manipulating it with the extension results in a correctly generated DOCX.

converter
    .convert(inputStream, false)
    .as(DocumentType.PDF)
    .to(os)
    .as(DocumentType.DOCX)
    .execute();

(edited to add simple Java code sample)

documents4j / documents4j.github.io

Conversion from PDF to DOCX results weird characters ! I'm using documents4j #2