We've encountered an issue with deadlocks arising from an interaction between Python's multiprocessing and logging libraries in gdal.py. These deadlocks occur sporadically, leading to batch jobs running significantly longer than expected or getting stuck entirely.
Problem:
The root cause appears to be the sharing of memory between processes in the multiprocessing library, which happens by default. This sharing can lead to deadlocks in certain scenarios.
As an initial fix, we disabled memory sharing, but this caused occasional Out of Memory errors.
For now, we’ve worked around this issue by removing the logging functionality to prevent deadlocks.
Next Steps:
The ultimate solution will involve refactoring the module to reduce the scope of multithreading, but this will require more extensive changes.
Current Workaround:
Temporarily, we've opted to remove logging to avoid the deadlocks until we can rewrite the multithreaded parts of the module.
Description:
We've encountered an issue with deadlocks arising from an interaction between Python's multiprocessing and logging libraries in gdal.py. These deadlocks occur sporadically, leading to batch jobs running significantly longer than expected or getting stuck entirely.
Problem:
Next Steps: The ultimate solution will involve refactoring the module to reduce the scope of multithreading, but this will require more extensive changes.
Current Workaround: Temporarily, we've opted to remove logging to avoid the deadlocks until we can rewrite the multithreaded parts of the module.
Commit that turned off logging in gdal.py: https://github.com/Open-EO/openeo-geopyspark-driver/commit/c1b8676b3c93183082c93b97601cabe38e643f34