PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
15.79k stars 1.55k forks source link

Consider adding DB cleanup scripts to save space in Postgres #5813

Open marvin-robot opened 2 years ago

marvin-robot commented 2 years ago

Opened from the Prefect Public Slack Community

semnooij: Hi, wrt keeping the database size and performance good, is there a way to set some sort of retention rate policy on the database that maintains the flows, flowruns, etc?

semnooij: I'm referring to a Postgres database for 2.0

anna: you may use e.g. https://github.com/citusdata/pg_cron|pg_cron or something like that to periodically run some DBA tasks such as purging old logs, but be careful about which data you delete, e.g. purging logs should be fine, but purging flow runs you need to be careful to not break other tables that reference that flow run ID

btw I thought you are on Cloud 2.0? if so, you don't have to worry at all about DBA tasks and retention policies

semnooij: I'm running my own Postgres and gui. There are no on delete cascade foreign keys?

semnooij: Actually. Let me check :)

semnooij: Looks fine to me actually. Some depend rows are deleted and some are set to null. No doubt there are some bugs here and there, but generally structure looks ok. (Referring to bugs as at some point I ran into the issue where I couldn't reset the database due to foreign key constraints. ) :slightly_smiling_face:

anna: yup, FK constraints is exactly what I meant - if not doing it carefully you may cause issues

semnooij: I guess it somewhere on the roadmap to provide some cleanup scripts to allow proper cleanup of the database?

semnooij: No need now but wondering

anna: I can open an issue if you want

anna: <@ULVA73B9P> open "Consider adding DB cleanup scripts to save space in Postgres"

Original thread can be found here.

majikman111 commented 1 year ago

@madkinsz Any thoughts on whether or not this will be implemented? I have a large number of flows running in a self hosted environment and require a custom script running raw SQL commands against the database to drop rows, otherwise my database would be eating up hundreds of gigs of space with old log data. The new artifacts table introduced in 2.7.12 with new FK constraints caused issues with that approach, to say the least. If there was a built in or at least documented recipe for cleaning up old data myself and I'm sure other users would appreciate it.

zanieb commented 1 year ago

@majikman111 this is not currently on our roadmap. I'd review a contribution of a service that deletes old data.

roy-wenrix commented 11 months ago

we also encountering the same issue as @majikman111 described are there any updates regarding it? or I will have to write a clean-up job myself?

urix24 commented 11 months ago

I have local server Prefect version 2.10.4 with Postgres. I'm trying to clean the database using SQL queries like this:

psql -U postgres -d $db -c "\
    delete from $table1 where created < now() - interval '35 days';"

These tables are growing fast:

table1='flow_run_state'
table2='log'
table3='flow_run'
table4='task_run_state'
table5='task_run'
table6='artifact'

It looks like everything is fine with deleting rows from these tables this way. Foreign-key constraints: and Referenced by: in table properties point to tables with the cascade delete property which are also listed above.

The block_document table is also growing quickly. But It seems that with the same deletion of rows there may be problems - rows in the deployment table will be deleted in a cascade.

\d+ block_document
Foreign-key constraints:
    "fk_block__block_schema_id__block_schema" FOREIGN KEY (block_schema_id) REFERENCES block_schema(id) ON DELETE CASCADE
    "fk_block_document__block_type_id__block_type" FOREIGN KEY (block_type_id) REFERENCES block_type(id) ON DELETE CASCADE
Referenced by:
    TABLE "block_document_reference" CONSTRAINT "fk_block_document_reference__parent_block_document_id___328f" FOREIGN KEY (parent_block_document_id) REFERENCES block_document(id) ON DELETE CASCADE
    TABLE "block_document_reference" CONSTRAINT "fk_block_document_reference__reference_block_document_i_5759" FOREIGN KEY (reference_block_document_id) REFERENCES block_document(id) ON DELETE CASCADE
    TABLE "deployment" CONSTRAINT "fk_deployment__infrastructure_document_id__block_document" FOREIGN KEY (infrastructure_document_id) REFERENCES block_document(id) ON DELETE CASCADE
    TABLE "deployment" CONSTRAINT "fk_deployment__storage_document_id__block_document" FOREIGN KEY (storage_document_id) REFERENCES block_document(id) ON DELETE CASCADE
    TABLE "flow_run" CONSTRAINT "fk_flow_run__infrastructure_document_id__block_document" FOREIGN KEY (infrastructure_document_id) REFERENCES block_document(id) ON DELETE CASCADE
    TABLE "flow_run_notification_policy" CONSTRAINT "fk_flow_run_alert_policy__block_document_id__block_document" FOREIGN KEY (block_document_id) REFERENCES block_document(id) ON DELETE CASCADE

How to clean block_document table ?

ChrisPaul33 commented 2 months ago

Hey, Is there any progress on the issue?