[spark] if has big offline data(on hdfs), how can I prepare train data use feast?

feast-dev / feast

The Open Source Feature Store for Machine Learning

https://feast.dev

Apache License 2.0

5.62k stars 1k forks source link

[spark] if has big offline data(on hdfs), how can I prepare train data use feast? #3681

Open zwqjoy opened 1 year ago

zwqjoy commented 1 year ago

if has big offline data(on hdfs), how can I prepare train data use feast?

Can write pyspark file, and submit spark task like below ? spark-submit \ --master yarn \ --queue product \ --deploy-mode cluster \ make_train_data_with_feast.py

shuchu commented 1 year ago

what is "make_train_data_with_feast.py"? To my knowledge, Feast does not store data itself. It uses third-party storage services as offline and online storage. For your files on hdfs, maybe you can start from here: https://docs.feast.dev/reference/offline-stores/spark

zwqjoy commented 1 year ago

@shuchu If I have many features saved in HDFS, If some one want to merge these features(like 2,3 path feature) to prepare train data. These features very large.

Before Need write the pyspark code, to read and merge
Now whether I can use feast with pyspark, to read the large features into local tmp feast , and use the get_historical_features to prepare train data

satriawadhipurusa commented 1 year ago

get_historical_features does not respect the Hive partitioned data and do a full table scan. I saw the query is using "<" operator instead of between. So for a table that has many partitions this could be a bottleneck.

Have you checked it? @zwqjoy

stale[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.