Closed MthwRobinson closed 3 months ago
Confirmed this is also the case for partition_ppt
, but not if you run partition_doc
first. Meaning, this seems to be some kind of first run issue with libreoffice
.
And for reference the error occurs here:
https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common.py#L413-L415
I can confirm that issue. I tried installing libreoffice inside of chainguard/wolfi
-based container and the same thing happens.
docker run -it --rm -v ./example-docs:/example-docs cgr.dev/chainguard/wolfi-base:latest /bin/sh
# inside container
apk add libreoffice
# Result: libreoffice 24.2.4.2 is installed
/usr/lib/libreoffice/program/soffice.bin --headless --convert-to docx --outdir /tmp /sample-docs/fake.doc
# Result: the command executes, but NOTHING happens
# Running the second time
/usr/lib/libreoffice/program/soffice.bin --headless --convert-to docx --outdir /tmp /sample-docs/fake.doc
# Result: doc -> docx conversion executes properly
Perhaps we could try building an apk for libreoffice 7.1.8.1 using melange but it does not work out of the box, compilation fails because libcmis-0.5
is not available (there's 0.6 though)
https://github.com/wolfi-dev/os/commit/50555837327bffd60639149320f2eef2aa0461b3
The only way to make it work (hacky) is to make a dummy soffice
call in the entrypoint script that executes on container start. First run of soffice
does nothing, but each consecutive call will function as expected. It works...
I ran docker container diff
to check what happens during soffice
command execution. It turned out that it creates plenty of config files. So then I tried adding this to my Dockefile:
# dummy config initialization
RUN soffice
but it was failing consistently. I have found that exiting with code 81 is "normal" and expected behavior, see this: https://github.com/jodconverter/jodconverter/issues/48#issuecomment-1863864333
Adding the following expectation (to get 81 exit code) solves the problem...
RUN /usr/bin/soffice --headless || [ $? -eq 81 ] || exit 1
Apply this to wolfi-based image
Describe the bug
To Reproduce
make docker-build
make docker-start-dev
elements = partition_doc("example-docs/fake.doc")
again. It will work and you'll see output fromelements[0].text
. This can be in the same or a new python session.Expected behavior I'd expect the command to run the first time.
Environment Info This is in the AMD64 Python docker image for
unstructured==0.14.2