Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.44k stars 580 forks source link

build: image and dependency updates; fix tesseract files locations #3310

Closed MthwRobinson closed 1 day ago

MthwRobinson commented 5 days ago

Summary

Updates to the latest version of the wolfi-base image. Changes include:

Testing

MthwRobinson commented 5 days ago

@micmarty-deepsense - Looks like tests are failing with libreoffice24. Specifically, soffice doesn't look like it converts ppt to pptx, though it does convert doc to docx. I forget why you swapped back to libreoffice7 in this PR. Was that it?

Weirdly, tests passed with libreoffice24 in #3065, so I wonder if something changed with the wolfi-os libreoffice24 build in the meantime.

MthwRobinson commented 5 days ago

@micmarty-deepsense - Look backed through our convo history and yeah, that looks like the same test failure you got that cause us to move back to libreoffice7. Was able to reproduce locally. Going to see if I can figure out what's causing it.

micmarty-deepsense commented 2 days ago

@micmarty-deepsense - Looks like tests are failing with libreoffice24. Specifically, soffice doesn't look like it converts ppt to pptx, though it does convert doc to docx. I forget why you swapped back to libreoffice7 in this PR. Was that it?

Weirdly, tests passed with libreoffice24 in #3065, so I wonder if something changed with the wolfi-os libreoffice24 build in the meantime.

Was that it?

@MthwRobinson sorry for late reponse, but yes, ppt format was the reason.