Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.5k stars 586 forks source link

bug/docker images at quay.io not up to date #3123

Open jpabbuehl opened 1 month ago

jpabbuehl commented 1 month ago

Hi,

I'm experiencing some regression with open source docker images (client and api) at

e.g. cli found in documentation not working as expected, larger image size, base-image not update, bug fixes, etc...

Knowing there is a SaaS offering, is this expected? or are there any CI rewiring to do on the open-source version? Surprisingly https://quay.io/repository/unstructured-io/unstructured-api?tab=tags and https://quay.io/repository/unstructured-io/unstructured?tab=tags keep being updated, but the issues are persisting...

Thanks a lot in advance

Relevant github issues https://github.com/Unstructured-IO/unstructured/issues/2274 https://github.com/Unstructured-IO/unstructured-api/issues/339 https://github.com/Unstructured-IO/unstructured-api/issues/387 https://github.com/Unstructured-IO/base-images/issues/11

MthwRobinson commented 1 month ago

Hi @jpabbuehl - are you using the AMD or ARM image? We just swapped the the AMD image over to Wolfi OS to mitigate CVEs.

neilkumar commented 1 month ago

@MthwRobinson The arm64 and amd64 images are pretty different in many ways. Are they going to converge back on a setup that works on both with the same versions?

MthwRobinson commented 1 month ago

@neilkumar - Yes, we'll likely move to Wolfi OS for both of them. Only reason we didn't move over the arm64 image already is that we haven't been able to build libreoffice for arm64 yet, and so moving that over now would meaning losing support for .doc/.ppt/.xls (though .docx, .pptx and .xlsx would still work.

neilkumar commented 1 month ago

@MthwRobinson Thanks for the response.

I did a little digging on Wolfi and found what I think is the package definition here:

https://github.com/wolfi-dev/os/blob/main/libreoffice-24.2.yaml

which led me to the last time it tried to build

https://github.com/wolfi-dev/os/actions/runs/9366118496/job/25785656360

which appears that it's blocked because of a Medium CVE CVE-2012-5639 from 13 years ago

that led me to

https://lwn.net/Articles/957219/

and the actual closed ticket from 13 years ago on the libreoffice side.

https://bugs.documentfoundation.org/show_bug.cgi?id=58295

Guessing this will not be resolved on the upstream side anytime soon.

MthwRobinson commented 4 weeks ago

Oh wow thanks @neilkumar for the links and really interesting background on that CVE! Think I should have bandwidth to take a closer look at this issue later this week.

jpabbuehl commented 3 weeks ago

@MthwRobinson amd64 thanks for the explanation. hope there is a workaround

MthwRobinson commented 3 weeks ago

@jpabbuehl - Could you clarify what's not working for you within the container? I just tried the workflow document here for the amd64 image and that worked fine for me.

neilkumar commented 2 weeks ago

@MthwRobinson The issue for me is that we want to run the exact same versions of the software across, and that we add some items to your base image. The amd64 uses "nonroot" and the arm64 is using "notebook-user", among the differences.

MthwRobinson commented 2 weeks ago

@neilkumar - #3213 updated the wolfi image to be closer to rockylinux image and the user name is now notebook-user again.