feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.62k stars 1k forks source link

Support SparkOfflineStore to read from FileSource #4661

Open Vishnu-Rangiah opened 1 month ago

Vishnu-Rangiah commented 1 month ago

Is your feature request related to a problem? Please describe. Provide support for using different data sources (SQL Table, Big Query Table, Parquet Table) through one offline transformation engine (most likely joining multiple data source tables using Spark Offline Store).

Describe the solution you'd like Allow the user to configure a spark cluster with the appropriate connectors to pull and PIT join different sources.

project: my_project
registry: data/registry.db
provider: local
offline_store:
    type: spark
    spark_conf:
        spark.master: "local[*]"
        spark.ui.enabled: "false"
        spark.eventLog.enabled: "false"
        spark.sql.catalogImplementation: "hive"
        spark.sql.parser.quotedRegexColumnNames: "true"
        spark.sql.session.timeZone: "UTC"
        spark.sql.execution.arrow.fallback.enabled: "true"
        spark.sql.execution.arrow.pyspark.enabled: "true"
        spark.jars.packages: "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.41.0",
        temporaryGcsBucket: "some-gcs-bucket",
        viewsEnabled: "true",
        materializationDataset: "some-bq-dataset"
online_store:
    path: data/online_store.db

Describe alternatives you've considered

Additional context Add any other context or screenshots about the feature request here.

Vishnu-Rangiah commented 1 month ago

@franciscojavierarceo

tokoko commented 1 month ago

allow me to be a little pedantic here :laughing:.. This is already possible, meaning there's nothing in core feast that disallows pulling data from multiple types of data sources. You are limited to use a single offline store as an engine, but the there's no restriction wrt the types of data sources.

It's also true that none of the offline stores currently support more than one data source types, but I'm just pointing out that this doesn't require a change in core feast, only in individual offline stores, for example we need to teach spark offline store how to read FileSource in addition to SparkSource and so on.

Vishnu-Rangiah commented 1 month ago

That makes sense, as you mentioned in the current spark offline store we are limited to only SparkSources as shown: https://github.com/feast-dev/feast/blob/master/sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py#L130

Removing this requirement and supporting the other stores would require some changes to how the sources are being read namely when using spark_session.read.format(feature_view.batch_source.file_format) and in the PIT Join SQL

tokoko commented 1 month ago

yup, that's right. I think it'd be better to open separate issues for specific OfflineStore/DataSource combinations that you require, for example: "Support SparkOfflineStore to read from FileSource".

Vishnu-Rangiah commented 4 weeks ago

Renamed to Issue.

I think this should be a simple implementation. Looking to eventually ... "Support SparkOfflineStore to read from BigQuerySource"