equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
103 stars 107 forks source link

Function step does not work on PBS cluster #1505

Closed markusdregi closed 3 years ago

markusdregi commented 3 years ago

Describe the bug Function step does not run on Azure cluster.

ModuleNotFoundError("No module named 'function_steps'")

To Reproduce Steps to reproduce the behavior:

  1. ssh <azure-machine>
  2. git clone https://github.com/equinor/ert.git
  3. python3 -m venv env
  4. source env/bin/activate
  5. pip install --upgrade pip
  6. cd ert
  7. pip install .
  8. cd examples/polynomial
  9. ert3 init
  10. Change local to pbs in experiments/function_evaluation/ensemble.yml
  11. ert3 run function_evaluation

Expected behaviour Running successfully

Enviromment

DanSava commented 3 years ago

There are 2 reasons why the function step does not work on the PBS cluster.

  1. The function result is persisted to disk using the shared drive storage. This issue should be resolved once the transmittable records PR is merged.
  2. The function_steps module where the user-defined functions should be added is not available on the execution nodes. The reason for this is that it seems it is not straight forward to pickle a function that is not defined in the __main__ module, together with its dependencies (ref: https://stackoverflow.com/questions/26389981/serialize-a-python-function-with-dependencies). Found one way to avoid this issue using the dill package to load the function source code and execute it in the main module (working example https://github.com/DanSava/ert/commit/8a0998c3912afb29623662eaa170393e129075a3). This approach requires a discussion, there might be better ways to use cloudpickel to avoid the use of dill to get the function source, but I could not find them. I tested the example locally by starting a dask-scheduler and 2 workers to be used by prefect.
sondreso commented 3 years ago

Reading through https://docs.prefect.io/core/advanced_tutorials/task-guide.html, it seems that they mention anything about the above problems in the documentation. They mention that inputs and outputs need to be cloudpicklable here, but I cant find any limitations on functions (except the warning about the signature). Did you say that you were looking into what prefect actually does @DanSava, because I think that could be interesting to know? 🤔

sondreso commented 3 years ago

Or actually the combination of tasks and cloudpickle is mentioned here. Might be that our use of attributes to transport the function is the problem here?

DanSava commented 3 years ago

Re-run the function_evaluation case on the PBS cluster after the fix https://github.com/equinor/ert/pull/1546 and also the record transmitting PR was merged and now it completes successfully.

Closing the issue.