WillEngler commented 1 year ago

Problem

I was trying to make a pipeline that uses MAST-ML. MAST-ML pins pyyaml == 5.4.1 in its requirements. garden-ai requires >= 6. Because we currently require garden-ai to be in the pip requirements sent to the container service, you can't include MAST-ML in a pipeline. The pyyaml versions conflict and registration fails at container build time.

There are tweaks we could do in garden or MAST-ML to get around this particular example. But this problem is more general. It will be unsustainable to keep garden-ai's requirements limited to the lowest common denominator of users' pipeline dependencies.

Imagine we want to include some common benchmarking or evaluation code at the garden level that does tabular data operations with Pandas. If we want to use pandas 2.x we will be preventing the registration of a ton of pipelines that use older pandas versions.

Potential Approaches

In order from most kludgey to least ...

Status Quo. Try to limit the dependencies of garden-ai so that this doesn't hit too many potential pipelines. Accept that users may not be able to use some libraries or some versions of libraries.
Skinny Garden . Similar to the home_run approach that DLHub took. Maybe we still need something bundled in the container, but we can segment it into a separate package with a smaller number of dependencies so that conflicts are greatly reduced.
No Dependencies. We include neither a full nor skinny garden package with containers. We find a way to turn pipeline files into Globus Compute functions that have no dependencies on the Garden SDK. (Maybe they still depend on a skinny MLFlow client if we can't help it.)

Acceptance Criteria

Given a user specifies a set of pip or conda dependencies that can be built by the container service, when they submit a pipeline with those requirements, then the container builds.

WillEngler commented 1 year ago

This is a complex task and we don't know exactly how we're going to approach it yet. So I think we should approach it as a research/prototyping spike. We will call this ticket done when:

There is a mini design doc for solving the problem that Will and Owen agree on.
We've broken it out into tractable tickets to replace this one.

WardLT commented 1 year ago

I like the "no dependencies" ideal - completely self-contained functions are awesome. Maybe one route to achieving it is to exploit that FuncX's primary serialization mechanism is to serialize code and write out a fake function for it to serialize. This will kind of be like writing out the "skinny Garden" library inside each container:

An idealized Garden function could be like:

def trojan_horse_for_good(*args, **kwargs): 
     import mlflow
     model_name = {{filled in during publication}}  # or, read from an env variable set during container build
     model = mlflow.get_my_model()
     return model.do_my_magic(*args, **kwargs)

Publication could be to write a temporary file to disk containing this function, importing that function, and then registering that function with FuncX (knowing that FuncX will grab the source code you wrote).

Cons:

Relies on FuncX's serialization prioritizing serialize-by-code
No clear mechanism for updating the Pipeline (do we republish the function for
Relies on user code not conflicting with MLFlow's requirements Pros:
Allows you to control exactly which libraries are required for the function by avoiding importing garden-ai

WillEngler commented 1 year ago

This week I've been trying something similar to what Logan suggested and hit a lot of roadblocks. Basically I want to be able to take a user's pipeline, compose the steps in it, inject the env variables we need for MLFlow auth, and send that over to Globus Compute. I have not been able to get that working.

Some things I learned ...

The bare minimum requirements I need to run mlflow.load_model are mlflow-skinny, pandas<3, numpy<2
Something about the function composition approaches I was trying to use pushed Globus Compute to use the CombinedCode serialization method as opposed to DillCode, which I was trying to force. The GC team is about to release the ability to pick your serialization method this coming Tuesday. That could help. In the meantime I was mired in serialization/deserialization errors and hit the end of my time box.

So ... I'm declaring defeat on the No Dependencies option and pivoting to Skinny Garden. Next up is to scope out the skinny garden approach.

WillEngler commented 1 year ago

Closing in favor of #158

OwenPriceSkelly commented 1 year ago

re-opening for the rush of closing it again when #190 is merged

Garden-AI / garden

Bug: Pipeline can't use library versions that conflict with garden-ai dependencies #131

Problem

Potential Approaches

Acceptance Criteria