Open mjurkus opened 9 months ago
I think there's a practice you can apply on your side first is to set up default table expiration for your BigQuery dataset. And do you mind creating a PR for this?
PR for which part? Modify the historical_datestamp_hash
table properties to make it expire? Or implement try / finally
in BigQueryRetrievalJob.to_remote_storage
once the export job is completed?
If modify table properties - that causes some complications:
The same BigQueryRetrievalJob.to_bigquery
function, where the table is created, is used to create a saved dataset via FeatureStore.create_saved_dataset
.
Probably the "try/finally" option is better for the current situation, with the risk that "to_remote_storage" can crash after the table is created.
try/finally sounds great to me as well
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Expected Behavior
Temporary tables in BigQueryRetrievalJob should be removed after the job completes or fails.
Current Behavior
When running materialization with a batch engine with BigQuery, the
historical_datestamp_hash
table is created to export data from the BQ temporary table. Then, the data is extracted to GCS Bucket, but the table is always retained.Steps to reproduce
Run the materialization job with BQ as the
offline_store
and usebatch_engine
i.e.,bytewax
.Specifications
Possible Solution
Add cleanup in
BigQueryRetrievalJob.to_remote_storage
.