Galileo-Galilei / kedro-mlflow

A kedro-plugin for integration of mlflow capabilities inside kedro projects (especially machine learning model versioning and packaging)
https://kedro-mlflow.readthedocs.io/
Apache License 2.0
199 stars 33 forks source link

Accessing the project context inside hooks implementations #66

Closed takikadiri closed 3 years ago

takikadiri commented 4 years ago

Hi,

I'm a colleague of @Galileo-Galilei. We discussed the issue of accessing the project template and properties from a plugin. We raised the point in the kedro project

What will it solve :

That solve : #64 #54 #30 #31 (User will use kedro credentials mechanisms, i'll put details in another issue) and a part of #62

How can we implement it (For now, i see three possibilities):

Keep up the good work guys, I'm happy to join you :)

Galileo-Galilei commented 4 years ago

I suggest that we go with the first solution (call load_context in hook before_pipeline_run) which is completely invisible to the user and will solve this painful problem. We will plan to migrate to KedroSession once official. @takikadiri Can you take it?

takikadiri commented 4 years ago

I digged into the implementation details of the first solution, and I can tell you it's going to be very hacky.

The MlflowNodeHook also needs to access the context, but it doesn't have the data (project_path, env, extra_params) to create it. That's mean the MlflowNodeHook and some other hooks in the plugin expects the before_pipeline_run hook to create the context and storing it in a global variable or somewhere else, so they (other hooks) can access to it.

One of the many side effets of using global variable as storage, is that the order of the hooks list given by the user in his kedro project will matter. The before_pipeline_run has to be always the first on the list.

I suggest to go for the second solution, while having in mind the migration to the third one, as soon as the kedro 0.17.0 is released. The second solution look like this

Do you see another way to implement the first solution ?

I'll create a pull request for this.

Galileo-Galilei commented 4 years ago

I understand that the second solution is much cleaner than the first (mostly because the first create another context object for each node) but i don't why you need it to create a global variable for the first. I think that creating an attribute on the fly is enough for our use cases (see here):

def before_pipeline_run(
        self, run_params: Dict[str, Any], pipeline: Pipeline, catalog: DataCatalog
    ) -> None:
        """Hook to be invoked before a pipeline runs.

        Args:
            run_params: The params used to run the pipeline.
                Should be identical to the data logged by Journal with the following schema::

                   {
                     "run_id": str
                     "project_path": str,
                     "env": str,
                     "kedro_version": str,
                     "tags": Optional[List[str]],
                     "from_nodes": Optional[List[str]],
                     "to_nodes": Optional[List[str]],
                     "node_names": Optional[List[str]],
                     "from_inputs": Optional[List[str]],
                     "load_versions": Optional[List[str]],
                     "pipeline_name": str,
                     "extra_params": Optional[Dict[str, Any]]
                   }

            pipeline: The ``Pipeline`` that will be run.
            catalog: The ``DataCatalog`` to be used during the run.
        """
        self.context=load_context(run_params["project_path"])
        self.env=run_params["env"]   # is it necessary?
        self.extra_params=run_params["extra_params"]  # is it necessary?

The informations will be accessible later in the hook. For our use cases, it is even simpler:

  1. Remove __init__ in MlflowNodeHook: https://github.com/Galileo-Galilei/kedro-mlflow/blob/e0afaf44ed045fc3ef205d6065307d897670a428/kedro_mlflow/framework/hooks/node_hook.py#L11-L18
  2. Load conf with the config loader at the beginning here:

https://github.com/Galileo-Galilei/kedro-mlflow/blob/e0afaf44ed045fc3ef205d6065307d897670a428/kedro_mlflow/framework/hooks/node_hook.py#L21-L40

something like:

config=kedroMlflowConfig.from_dict(self.context.config_loader("mlflow.yml"))
...
  1. Do the same change here in MlflowPipelineHook

https://github.com/Galileo-Galilei/kedro-mlflow/blob/e0afaf44ed045fc3ef205d6065307d897670a428/kedro_mlflow/framework/hooks/pipeline_hook.py#L65-L67

And it should be ok. The reason for why I prefer solution 1 over solution 2 is that the first one is a non breaking change, while the second modifies the hooks arguments and consequently is a breaking change. Since none of them is the target solution, I think we can deal with it for a few months. Does it sounds good for you?