Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.66k stars 707 forks source link

`partition_doc` fails the first time it is run in the AMD64 container #3105

Closed MthwRobinson closed 3 months ago

MthwRobinson commented 4 months ago

Describe the bug

To Reproduce

Expected behavior I'd expect the command to run the first time.

Environment Info This is in the AMD64 Python docker image for unstructured==0.14.2

MthwRobinson commented 4 months ago

Confirmed this is also the case for partition_ppt, but not if you run partition_doc first. Meaning, this seems to be some kind of first run issue with libreoffice.

MthwRobinson commented 4 months ago

And for reference the error occurs here:

https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common.py#L413-L415

micmarty-deepsense commented 3 months ago

I can confirm that issue. I tried installing libreoffice inside of chainguard/wolfi-based container and the same thing happens.

How to reproduce

docker run -it --rm -v ./example-docs:/example-docs cgr.dev/chainguard/wolfi-base:latest /bin/sh

# inside container
apk add libreoffice
# Result: libreoffice 24.2.4.2 is installed

/usr/lib/libreoffice/program/soffice.bin --headless --convert-to docx --outdir /tmp /sample-docs/fake.doc
# Result: the command executes, but NOTHING happens

# Running the second time
/usr/lib/libreoffice/program/soffice.bin --headless --convert-to docx --outdir /tmp /sample-docs/fake.doc
# Result: doc -> docx conversion executes properly

Other ideas

Perhaps we could try building an apk for libreoffice 7.1.8.1 using melange but it does not work out of the box, compilation fails because libcmis-0.5 is not available (there's 0.6 though) https://github.com/wolfi-dev/os/commit/50555837327bffd60639149320f2eef2aa0461b3

Workaround

The only way to make it work (hacky) is to make a dummy soffice call in the entrypoint script that executes on container start. First run of soffice does nothing, but each consecutive call will function as expected. It works...

micmarty-deepsense commented 3 months ago

Solution

I ran docker container diff to check what happens during soffice command execution. It turned out that it creates plenty of config files. So then I tried adding this to my Dockefile:

# dummy config initialization 
RUN soffice

but it was failing consistently. I have found that exiting with code 81 is "normal" and expected behavior, see this: https://github.com/jodconverter/jodconverter/issues/48#issuecomment-1863864333

Adding the following expectation (to get 81 exit code) solves the problem...

RUN /usr/bin/soffice --headless || [ $? -eq 81 ] || exit 1

image

What to do next

Apply this to wolfi-based image