Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

build: switch arm64 image to wolfi-base #3268

Closed MthwRobinson closed 1 week ago

MthwRobinson commented 1 week ago

Summary

Updates the arm64 build to use the same Dockerfile as amd64, since there are now upstream base images for wolfi-base for both architectures. The legacy rockylinux-9.4 is now stashed in a subdirectory the docker subdirectory and is no longer built in CI, but is available is users would like to build it themselves.

Additionally, this PR includes a fix to symlink python3 to python3.11, which had caused a CI failure here.

BREAKING CHANGE: the arm64 image no longer supports .doc, .pptx, or .xls because we do not yet have a libreoffice apk built for wolfi-base. We intend to address that as a follow on. All other filetypes work.

Testing

Successfully docker builds, tests, and smoke tests for amd64 and arm64 on the feature branch (with publish disabled).

hughesadam87 commented 2 days ago

Appreciate this work - these changes to dockefile seem to improve security scans tremendously. Is the goal here to get the dockerfile secure for vulnerability scans?

MthwRobinson commented 2 days ago

Thanks @hughesadam87 ! Yeah the change to wolfi is intended to improve the security posture for our images.

hughesadam87 commented 2 days ago

Thanks @hughesadam87 ! Yeah the change to wolfi is intended to improve the security posture for our images.

Ah great. Let me ask this - your full installation depends on libreoffice. Wolfi recently started supported libreoffice. Is the libreoffice bundled with the full install of unstructured planning use the wofli libreoffice build?

MthwRobinson commented 2 days ago

Coming soon! We have a PR in our base images repo that will switch to using the libreoffice package from the wolfi package manager. We'll have libreoffice available for the arm64 build once we switch to the new upstream base image. Planning to have that in before the end of the week.