feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.44k stars 967 forks source link

BigQueryRetrievalJob does not remove tables used for data export #3828

Open mjurkus opened 9 months ago

mjurkus commented 9 months ago

Expected Behavior

Temporary tables in BigQueryRetrievalJob should be removed after the job completes or fails.

Current Behavior

When running materialization with a batch engine with BigQuery, the historical_datestamp_hash table is created to export data from the BQ temporary table. Then, the data is extracted to GCS Bucket, but the table is always retained.

Steps to reproduce

Run the materialization job with BQ as the offline_store and use batch_engine i.e., bytewax.

Specifications

Possible Solution

Add cleanup in BigQueryRetrievalJob.to_remote_storage.

sudohainguyen commented 9 months ago

I think there's a practice you can apply on your side first is to set up default table expiration for your BigQuery dataset. And do you mind creating a PR for this?

mjurkus commented 9 months ago

PR for which part? Modify the historical_datestamp_hash table properties to make it expire? Or implement try / finally in BigQueryRetrievalJob.to_remote_storage once the export job is completed?

If modify table properties - that causes some complications: The same BigQueryRetrievalJob.to_bigquery function, where the table is created, is used to create a saved dataset via FeatureStore.create_saved_dataset.

shuchu commented 9 months ago

Probably the "try/finally" option is better for the current situation, with the risk that "to_remote_storage" can crash after the table is created.

sudohainguyen commented 9 months ago

try/finally sounds great to me as well

stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.