feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.56k stars 994 forks source link

Add AWS native support for Feast #367

Closed Jeffwan closed 3 years ago

Jeffwan commented 4 years ago

Is your feature request related to a problem? Please describe. Data Ingestion, feature query and some other components relies on GCP services like BidQuery, DataFlow. In order to make feast work on AWS. We'd like to add support for AWS using services like Athena, Glue or EMR.

Currently, AWS doesn't have managed Apache Beam jobs support, but there's lots of alternatives like Beam with Spark or Flink runner.

Describe the solution you'd like

  1. We need to make interface generic enough to support different cloud providers.
  2. Add AWS services support in feast.

Describe alternatives you've considered

Additional context We'd like to know if feast internally already have some discussion on this. I am from AWS and would love to take design and implementation work.

woop commented 4 years ago

Is your feature request related to a problem? Please describe. Data Ingestion, feature query and some other components relies on GCP services like BidQuery, DataFlow. In order to make feast work on AWS. We'd like to add support for AWS using services like Athena, Glue or EMR.

Currently, AWS doesn't have managed Apache Beam jobs support, but there's lots of alternatives like Beam with Spark or Flink runner.

Describe the solution you'd like

  1. We need to make interface generic enough to support different cloud providers.
  2. Add AWS services support in feast.

Describe alternatives you've considered

Additional context We'd like to know if feast internally already have some discussion on this. I am from AWS and would love to take design and implementation work.

Hi @Jeffwan,

Thank you for the interest. We definitely want to support non-GCP products. We already have @ches @smadarasmi who are planning to run Feast on-prem, which means they are implementing different store and runner types.

Other than that we've had a lot of demand for Feast to have either AWS or open source store/runner support. It's probably the most asked for functionality.

I agree with your proposal. We absolutely have to have generic interfaces that are easy to extend to various providers/technologies. We are currently close to cutting 0.4 which brings one or two important changes (project namespacing, async job management). That should land in 2 weeks.

Directly after that I would love to collaborate on an RFC for adding non-GCP technologies.

Which AWS store would you say is the higher priority to implement first? Redshift or Athena?

sbrooks-cerity commented 4 years ago

Hi @woop

The Data Science team I work with has a huge interest in Feast but we are an AWS only shop. After discussions with them they don't have any hard requirements that necessitate either Athena or Redshift so for us our vote would be on whichever route would be the least path of resistance for integration and we can adjust fire on our end to leverage Feast.

Jeffwan commented 4 years ago

@sbrooks-cerity do you have any specific requirements on where to run beam jobs?

Jeffwan commented 4 years ago

@woop Sorry getting so long to come back to the issue. We have some internal discussion and plan to put efforts to add it. Athena will get higher priority. The challenge we have now is we don't have managed beam service. Seems either kinesis analytics, EMR or EKS (with operator runner) can support beam application. We'd like to hear more feedbacks from community users and make the decision.

woop commented 4 years ago

@woop Sorry getting so long to come back to the issue. We have some internal discussion and plan to put efforts to add it. Athena will get higher priority. The challenge we have now is we don't have managed beam service. Seems either kinesis analytics, EMR or EKS (with operator runner) can support beam application. We'd like to hear more feedbacks from community users and make the decision.

Thanks for staying involved @Jeffwan!

I wonder if we should be focusing on supporting Flink or Spark (#362) instead of a managed Beam service (like Dataflow). This would also make it easier for us to support runners in our dev/testing environment.

I am also open to adding Athena support, but our focus in 0.6 will probably be to first add support for an on-prem warehouse. I will create a an issue to enumerate all of the Feast extension points. Hopefully this can become the basis for discussions of how Feast can be extended and the functionality that these components must provide.

Jeffwan commented 4 years ago

Thanks for staying involved @Jeffwan!

I wonder if we should be focusing on supporting Flink or Spark (#362) instead of a managed Beam service (like Dataflow). This would also make it easier for us to support runners in our dev/testing environment.

Definitely. Managed service is ok, we need to build abstraction and make it supports other runners as well.

I am also open to adding Athena support, but our focus in 0.6 will probably be to first add support for an on-prem warehouse. I will create a an issue to enumerate all of the Feast extension points. Hopefully this can become the basis for discussions of how Feast can be extended and the functionality that these components must provide.

Same options, It would be great to leave interface for both cloud and on-prem users and we can better extend it. I will follow the issue your create and we can discuss more details there

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ches commented 4 years ago

@Jeffwan Are you at a point in requirements gathering where we can tighten the focus/scope of this issue, and/or break it out into sub-issues?

I'd like to re-title it for tracking, "AWS support" is quite a broad thing if we're talking about various storage engines, Beam runners, ensuring Helm charts are friendly for any managed Kubernetes service like EKS, etc. It's possible people have done work on some of these areas already, or can do it in parallel to someone else's focus.

Jeffwan commented 4 years ago

@ches

I agree AWS support is quite a broad story. This is like a parent issue and try to collect more detail requirements on the AWS support.

I think we will have a few phase for this issue.

Phase 1

Add batch store, seems most of the users are looking for Athena. I am check the efforts to implement a new connector for Athena. Even Beam IO doesn't have native integration, I think it's not a problem.

Goal: User will get a workable solution for both online and offline feature.

Phase 2

Both Kafka, Redis have corresponding managed service in AWS, they're Amazon MSK and Amazon ElastiCache. They should work with current feast seamlessly. We need some tutorial for users who like to use managed services and probably provides some cloudformation template to provision AWS resources.

Ingestion currently only supports DirectRunner and DataflowRunner, AWS doesn't have managed Beam services, an alternative for large scale data is needed, either try to use Glue or EMR to run these jobs or use Kubernetes native flink operator solution.

Goal: User can use feast for production grade workloads and can leverage all capability of AWS.

Phase 3

to be determined.

What do you think?

dr3s commented 4 years ago

@Jeffwan I'm working on a batch store and the thing I didn't realize is the client server reliance on GCP storage for data transfer. Unlike the storage API this has not yet been made pluggable. I'm going to open a new issue and start work on this first unless someone else has a branch in progress. I was going to approach it tactically with direct support for s3:// urls. After that I'd like to follow up about a more flexible API.

There is some overlap in needs between Phase 1 and 2 that may be good to draw out. Extracting from Athena PITC features will likely require Beam or some other data processing. It may be worth thinking about how to run beam on flink in Glue/EMR/kubernetes to both process data from athena and kafka.

ches commented 4 years ago

@Jeffwan Sounds good, I understand better where you're coming from now in the overall picture and agree with viewing this issue as an epic that will have child tasks. We could put this issue and those that come "under" it in a project if that's helpful.

I suggest we create a dedicated issue to track Athena implementation as soon as you're concretely moving on that if it's decided, better visibility to Feast users and contributors. Implementing a Beam IO for Athena sounds like a good move to me FWIW.

I think you're right about the biggest challenge being the analog of a Beam Runner you note in your Phase 2. This falls near to #444 (kind of a generalization of #362). I hadn't thought of that issue as posing alternatives to Beam entirely, more about how to cleanly support more choices of Beam Runners, but perhaps you will come to a conclusion that may be the best way you can proceed—if so I think discussion of such a proposal should go to #444.

Thanks for your response and interest in carrying this through.

dr3s commented 4 years ago

@ches @woop We are moving forward with implementation of some of the components necessary to run in AWS. A good first step will be to create an RFC to outline the approach. I started one here: https://docs.google.com/document/d/14ouhFlFiw2OXW5m_esoW0fI9iR7N0TfMd_szi7O2aCk

So far we have noted the following needs:

  1. Deployment to EKS - new charts or flexibility for cloud-provider in the existing ones
  2. S3 staging support - client sdk and server changes
  3. Historical storage provider - We are starting with snowflake but some HIVE-compatible provider would be nice
  4. Beam runner - we are probably going to try the Flink k8s operator
  5. Authentication - generic oauth2 oidc support not tied to GCP
  6. Updates to documentation, examples, and docker stuff
woop commented 4 years ago

@ches @woop We are moving forward with implementation of some of the components necessary to run in AWS. A good first step will be to create an RFC to outline the approach. I started one here: https://docs.google.com/document/d/14ouhFlFiw2OXW5m_esoW0fI9iR7N0TfMd_szi7O2aCk

So far we have noted the following needs:

  1. Deployment to EKS - new charts or flexibility for cloud-provider in the existing ones
  2. S3 staging support - client sdk and server changes
  3. Historical storage provider - We are starting with snowflake but some HIVE-compatible provider would be nice
  4. Beam runner - we are probably going to try the Flink k8s operator
  5. Authentication - generic oauth2 oidc support not tied to GCP
  6. Updates to documentation, examples, and docker stuff

Hi @dr3s,

Excited about this, thanks for kicking off the document!

  1. EKS: Chart support shouldn't be too hard. I think the existing charts might just need some slight tweaks.
  2. Correct, for the time being this can be modeled after the GCP implementation.
  3. JDBCIO seems to work fine for this use case. Hoping to push my code out soon, but It's not done yet.
  4. Flink support would be massive. DirectRunner isn't production grade so the Beam runner dependency on Dataflow is a subtle dependency that we still have on GCP.
  5. We should be able to rework the existing PR right? #504
  6. Would be appreciated.

Let me know if you need help anywhere, especially with RFC reviews or contribution. Otherwise I will see if I can get (3) and (5) ready for you.

woop commented 4 years ago

We have an RFC for Feast on AWS over here: https://docs.google.com/document/d/1snRxVb8ipWZjCiLlfkR4Oc28p7Fkv_UXjvxBFWjRBj4/edit

Feedback welcome.

rceballos98 commented 3 years ago

I see AWS support as part of 0.8 release, does this mean it is fully supported now? From the changelog it seems it is not yet fully there? https://github.com/feast-dev/feast/blob/master/CHANGELOG.md

woop commented 3 years ago

I see AWS support as part of 0.8 release, does this mean it is fully supported now? From the changelog it seems it is not yet fully there? https://github.com/feast-dev/feast/blob/master/CHANGELOG.md

AWS support has landed. The only part that is missing is the offline store. We expect the data that you would like to serve to be in S3 in Parquet format, and we will generate training datasets from that data.

woop commented 3 years ago

Closing this issue for now. Please see https://docs.feast.dev/getting-started/install-feast/kubernetes-amazon-eks-with-terraform for more details on AWS support. Let's create individual feature requests for specific functionality that we want to add (specific databases like Dynamo for example).

davidshtian commented 3 years ago

Closing this issue for now. Please see https://docs.feast.dev/getting-started/install-feast/kubernetes-amazon-eks-with-terraform for more details on AWS support. Let's create individual feature requests for specific functionality that we want to add (specific databases like Dynamo for example).

Not sure the current offline store implementation on AWS is S3 in Parquet format or Hive on EMR? Thanks~