Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.08k stars 762 forks source link

Support for AWS Step functions #2

Closed romain-intel closed 4 years ago

romain-intel commented 4 years ago

Metaflow on AWS currently requires a human-in-the-loop to execute and cannot automatically be scheduled. Metaflow could be made to work with AWS Step functions to allow the orchestration of Metaflow steps to be done by AWS.

gonzalodiaz commented 4 years ago

I just arrived to Metaflow and I'm thrilled to give it a try in my company. Currently we are using Airflow on Kubernetes to schedule workflows. I would like to hear if you analyzed the possibility of scheduling Metaflow over Airflow. And if it would be possible to use K8s as infrastructure to run the steps. Thanks!

savingoyal commented 4 years ago

Hi @gonzalodiaz Thanks for giving Metaflow a try. We follow a plugins based architecture and it is indeed possible to schedule flows over Airflow and use K8s as the compute substrate and something we would like to offer in the near future. We welcome feature requests. Please open one.

thundergolfer commented 4 years ago

Is your team familiar with https://github.com/argoproj/argo? In theory you could compile your Flows down into Argo's workflow spec format (JSON/YAML) and then Argo could take care of execution.

savingoyal commented 4 years ago

Thanks for the link. Yes I am familiar with argo but haven’t looked at it in depth.

impredicative commented 4 years ago

Metaflow on AWS currently requires a human-in-the-loop to execute and cannot automatically be scheduled. Metaflow could be made to work with AWS Step functions to allow the orchestration of Metaflow steps to be done by AWS.

Given that Metaflow is evidently seriously lacking a scheduler, either Step Functions or better yet an open source component of Metaflow itself can probably fill in the gap. Without a scheduler, indeed it seems to be an incomplete solution.

NukaCody commented 4 years ago

For step function integration, is it possible to incorporate https://github.com/aws/aws-step-functions-data-science-sdk-python?

impredicative commented 4 years ago

For step function integration, is it possible to incorporate https://github.com/aws/aws-step-functions-data-science-sdk-python?

As an observer, I don't see any need for AWS Step Functions integration since Metaflow should be able to manage workflow steps directly. Why pay extra for Step Functions?

hgahlot commented 4 years ago

AWS Step Functions need to be scheduled through CloudWatch. They do not have an in-built scheduler. However, CloudWatch has a direct integration with Step Functions. It might be better to look into how CloudWatch + Lambda may be leveraged to act as a scheduler for Metaflow, separate from Step Functions.

Metaflow could be made to work with AWS Step functions to allow the orchestration of Metaflow steps to be done by AWS.

Metaflow is an orchestrator itself so I think the only missing piece is to figure out the scheduling aspect. Using Step Functions as an orchestrator just because we need it to schedule Metaflow workflows is an overkill, IMO.

impredicative commented 4 years ago

Metaflow could in principle then manage those Cloudwatch Events and Lambdas too using a single combined job+schedule definition. This would be the simplest scheduler integration assuming one cannot be built-in or integrated into Metaflow directly. I would still prefer the integration and use of an open source scheduler into Metaflow though to avoid the reliance on Cloudwatch Events and Lambdas.

steveash commented 4 years ago

Also maybe check out Glue Workflows which are a little more DAG-like compared to the Step Functions model https://docs.aws.amazon.com/glue/latest/dg/workflows_overview.html

kylejmcintyre commented 4 years ago

I'm confused about this statement:

Netflix uses an internal DAG scheduler to orchestrate most modeling and ETL pipelines in production. Metaflow flows can be deployed to the production scheduler with a single command. A similar integration could be provided e.g. for AWS Step Functions (Github issue)

Is this saying that there isn't yet a way to schedule a flow to run in production, or that there's no DAG schedulor/executor to actually run a flow in a production setting? Thank you.

savingoyal commented 4 years ago

@kylejmcintyre Internally we export metaflow flows to a DAG scheduler. A similar integration with AWS Step Functions is in the works.

kylejmcintyre commented 4 years ago

Thanks for your reply @savingoyal . Is what you do internally available to me as an open-source consumer? If so, is it considered a hidden/internal implementation detail currently that runs on my provisioned compute resources? Or is executing flows in a production setting not yet supported for folks outside of Netflix?

savingoyal commented 4 years ago

@kylejmcintyre Given that the DAG scheduler (Meson) we use internally is not an open-source project, we are working on an equivalent integration with AWS Step Functions to offer similar capabilities in metaflow OSS as we speak.

impredicative commented 4 years ago

Why is AWS Step Functions even needed then? It's just going to increase the bill by doing something that open source software can do for free. The real hardware which is needed is provided by AWS Batch/EC2/ECS and similar services.

savingoyal commented 4 years ago

@impredicative There are not very many production-grade DAG schedulers (no SPOF, HA, scalable) with good adoption in the open-source community. AWS Step Functions offers the guarantees that we seek from a production-grade scheduler and our integration can serve as a reference implementation for integrations with other schedulers.

joe153 commented 4 years ago

I am interested in using this project but the obvious blocker is the scheduler. @savingoyal: do you have a rough timeline when it could be available?

impredicative commented 4 years ago

@impredicative There are not very many production-grade DAG schedulers (no SPOF, HA, scalable) with good adoption in the open-source community. AWS Step Functions offers the guarantees that we seek from a production-grade scheduler and our integration can serve as a reference implementation for integrations with other schedulers.

As it has been noted in this issues before, Step Functions use Cloudwatch Events for scheduling. Has this changed? If not, why is Step Functions being referred to a scheduler? Does Metaflow then really still need Step Functions integration, or is it Cloudwatch Events integration that it needs?

savingoyal commented 4 years ago

@impredicative AWS Step Functions is a scheduler as it schedules tasks on AWS Batch. The state machine by itself can be triggered by using CloudWatch Events. Metaflow is not meant to be a replacement for a production-grade scheduler and through our integrations, we advocate that users publish their production workflows onto a production scheduler.

savingoyal commented 4 years ago

This feature is now generally available. The launch blog post is here and the documentation is here.