This PR addresses an issue where the code could attempt to run soffice in multiple processes and closes #3284
The fix is to add a wait mechanism when there is another soffice process running in already.
Diagnosis of issue
soffice can only have one process running when using the command soffice as is.
on main branch the function partition.common.convert_office_doc simply spawns a subprocess to run soffice command to convert a doc or ppt file into docx or pptx format.
if there are multiple partition calls to process doc or ppt files and they all want to spawn soffice subprocesses only one will succeed while other processes will simply fail and return 1 from the subprocess
in downstream this will lead to errors like PackageNotFoundError: Package not found at '/tmp/tmpac6lcu4w/document.docx'
solution
While there are ways to circumvent the limit of soffice by setting a tmp file as user installation env, these kind of solutions rely on the internals of soffice and adds maintenance cost to track its changes.
This PR solves this problem by adding a wait mechanism:
we first spawning a subprocess to run soffice
if the stdout is empty and we still have wait time budget left the function first checks if there is another soffice running
If yes then the function waits for 0.01s before checking again;
if no the functions spawns a subprocess to run soffice and return to beginning of this step
we need to return the the beginning to check if stdout is empty because we could have another collision right after soffice becomes available.
test
This PR adds two unit tests.
Additionally this can be tested by running partition of .doc files locally with multiprocessing.
Summary
This PR addresses an issue where the code could attempt to run
soffice
in multiple processes and closes #3284 The fix is to add a wait mechanism when there is anothersoffice
process running in already.Diagnosis of issue
soffice
can only have one process running when using the commandsoffice
as is.partition.common.convert_office_doc
simply spawns a subprocess to runsoffice
command to convert adoc
orppt
file intodocx
orpptx
format.doc
orppt
files and they all want to spawnsoffice
subprocesses only one will succeed while other processes will simply fail and return 1 from the subprocessPackageNotFoundError: Package not found at '/tmp/tmpac6lcu4w/document.docx'
solution
While there are ways to circumvent the limit of
soffice
by setting a tmp file as user installation env, these kind of solutions rely on the internals ofsoffice
and adds maintenance cost to track its changes.This PR solves this problem by adding a wait mechanism:
soffice
stdout
is empty and we still have wait time budget left the function first checks if there is anothersoffice
runningsoffice
and return to beginning of this stepstdout
is empty because we could have another collision right aftersoffice
becomes available.test
This PR adds two unit tests. Additionally this can be tested by running partition of
.doc
files locally with multiprocessing.