Accessing the project context inside hooks implementations

takikadiri commented 4 years ago

Hi,

I'm a colleague of @Galileo-Galilei. We discussed the issue of accessing the project template and properties from a plugin. We raised the point in the kedro project

What will it solve :

No more Template assumptions. We will ask the kedro context where the template component resides (project_paths, conf_paths, src path, ...). That lead to less kedro updates regression, because we will no longer manage a template at our side.
No more configs loading logic assumptions. We will use the kedro context configLoader and get_crendentials (that can be overrided by user).

That solve : #64 #54 #30 #31 (User will use kedro credentials mechanisms, i'll put details in another issue) and a part of #62

How can we implement it (For now, i see three possibilities):

Calling a load_context inside before_pipeline_run hook and setting at the same time a global project_path variable, hoping that the before_pipeline_run hook is always the first hook that will be called in the futur. --> So hacky
Declaring the mlflow hooks as a project context property. Which means that the hooks can easily access to the current context --> No more auto registration
Waiting for the release of the kedro session. It will apparently allows access to the current context --> We don't know when will it land

Keep up the good work guys, I'm happy to join you :)

Galileo-Galilei commented 4 years ago

I suggest that we go with the first solution (call load_context in hook before_pipeline_run) which is completely invisible to the user and will solve this painful problem. We will plan to migrate to KedroSession once official. @takikadiri Can you take it?

takikadiri commented 4 years ago

I digged into the implementation details of the first solution, and I can tell you it's going to be very hacky.

The MlflowNodeHook also needs to access the context, but it doesn't have the data (project_path, env, extra_params) to create it. That's mean the MlflowNodeHook and some other hooks in the plugin expects the before_pipeline_run hook to create the context and storing it in a global variable or somewhere else, so they (other hooks) can access to it.

One of the many side effets of using global variable as storage, is that the order of the hooks list given by the user in his kedro project will matter. The before_pipeline_run has to be always the first on the list.

I suggest to go for the second solution, while having in mind the migration to the third one, as soon as the kedro 0.17.0 is released. The second solution look like this

Do you see another way to implement the first solution ?

I'll create a pull request for this.

Galileo-Galilei commented 4 years ago

I understand that the second solution is much cleaner than the first (mostly because the first create another context object for each node) but i don't why you need it to create a global variable for the first. I think that creating an attribute on the fly is enough for our use cases (see here):

def before_pipeline_run(
        self, run_params: Dict[str, Any], pipeline: Pipeline, catalog: DataCatalog
    ) -> None:
        """Hook to be invoked before a pipeline runs.

        Args:
            run_params: The params used to run the pipeline.
                Should be identical to the data logged by Journal with the following schema::

                   {
                     "run_id": str
                     "project_path": str,
                     "env": str,
                     "kedro_version": str,
                     "tags": Optional[List[str]],
                     "from_nodes": Optional[List[str]],
                     "to_nodes": Optional[List[str]],
                     "node_names": Optional[List[str]],
                     "from_inputs": Optional[List[str]],
                     "load_versions": Optional[List[str]],
                     "pipeline_name": str,
                     "extra_params": Optional[Dict[str, Any]]
                   }

            pipeline: The ``Pipeline`` that will be run.
            catalog: The ``DataCatalog`` to be used during the run.
        """
        self.context=load_context(run_params["project_path"])
        self.env=run_params["env"]   # is it necessary?
        self.extra_params=run_params["extra_params"]  # is it necessary?

The informations will be accessible later in the hook. For our use cases, it is even simpler:

Remove __init__ in MlflowNodeHook: https://github.com/Galileo-Galilei/kedro-mlflow/blob/e0afaf44ed045fc3ef205d6065307d897670a428/kedro_mlflow/framework/hooks/node_hook.py#L11-L18
Load conf with the config loader at the beginning here:

https://github.com/Galileo-Galilei/kedro-mlflow/blob/e0afaf44ed045fc3ef205d6065307d897670a428/kedro_mlflow/framework/hooks/node_hook.py#L21-L40

something like:

config=kedroMlflowConfig.from_dict(self.context.config_loader("mlflow.yml"))
...

Do the same change here in MlflowPipelineHook

https://github.com/Galileo-Galilei/kedro-mlflow/blob/e0afaf44ed045fc3ef205d6065307d897670a428/kedro_mlflow/framework/hooks/pipeline_hook.py#L65-L67

And it should be ok. The reason for why I prefer solution 1 over solution 2 is that the first one is a non breaking change, while the second modifies the hooks arguments and consequently is a breaking change. Since none of them is the target solution, I think we can deal with it for a few months. Does it sounds good for you?

Galileo-Galilei / kedro-mlflow

Accessing the project context inside hooks implementations #66