Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.61k stars 704 forks source link

fix: wait to run soffice until there is no other soffice process running #3287

Closed badGarnet closed 3 months ago

badGarnet commented 3 months ago

Summary

This PR addresses an issue where the code could attempt to run soffice in multiple processes and closes #3284 The fix is to add a wait mechanism when there is another soffice process running in already.

Diagnosis of issue

solution

While there are ways to circumvent the limit of soffice by setting a tmp file as user installation env, these kind of solutions rely on the internals of soffice and adds maintenance cost to track its changes.

This PR solves this problem by adding a wait mechanism:

test

This PR adds two unit tests. Additionally this can be tested by running partition of .doc files locally with multiprocessing.