fnproject / fn

The container native, cloud agnostic serverless platform.
http://fnproject.io
Apache License 2.0
5.76k stars 405 forks source link

Wishlist - make it easy to remove data from previous calls from the databases #1027

Closed chryswoods closed 5 years ago

chryswoods commented 6 years ago

I am enjoying working with Fn, especially using asynchronous functions. An issue is that there is no easy way to remove data created from old function calls. Currently I have to edit the log and output databases manually. It would be useful to have the ability to do this with Fn directly, e.g.

fn calls delete APP CALL_ID or fn logs delete APP CALL_ID

would remove all data associated with the specified call ID in the specified application. This would prevent the server from filling up with old data from old calls. It would also allow me to build logic into my application that detects when a user has retrieved the output from a call, and so safely delete that data from the server.

carimura commented 6 years ago

Hi @chryswoods thanks for the issue. Is this specifically to prevent disk full issues?

I noticed we actually removed the API call at one point per issue #481 but there's some thought in there about if/when/how it might be added back in. Any thoughts?

chryswoods commented 6 years ago

Yes, this is to prevent disk full issues. The use case is using Fn to run molecular dynamics simulations on demand as individual functions. Fn provides the serverless interface for simulations that are initiated, queried and then collected from Jupyter notebooks running in a k8s cluster. This allows the notebooks to use relatively low-powered cloud instances for the k8s cluster, with big fat nodes in the cloud or on-premise used on-demand to run simulations when they are invoked as a Fn function.

The simulations are long, and so async functions are needed. The Jupyter interface will capture the CALL_ID and then use this later to query when the simulation has finished, and to then get the ID of the data cache in which the data is generated. The user will transfer the data from the data cache to storage connected to the notebook using this data ID. Once the data has been transferred, the system should delete everything related to the request. This is mainly to prevent disk-full issues, but also for data governance whereby commercial users would not want records of the simulation run to be retained on the system for any longer than was necessary to run the calculation.

Looking at #481 I understand that you don't want users to be able to delete their logs for auditing reasons. The compromise of using a cleaner to remove everything completed that is older than 7 days is not an option as it could clean out the results of an async function before the user has a chance to collect the results. Perhaps a better idea would be for the user to be able to flag a call as deleted, but this only sets a "deletable" flag on the record? Then a cleaner periodically prunes the database to remove all "deletable" records that are older than a set time (e.g. 7 or 30 days)? Essentially, deleting really moves things to trash, and only the sysadmin or the cleaner script can remove things from the trash?

rdallman commented 6 years ago

thanks @chryswoods I think a configurable cleaner interval is what we were thinking about, to have a better out of the box xp around this, deleting all call and log records before some set time remedies one issue surfaced in #481.

it may be easier in the short term to have DELETE /calls?before="01/01/01 01:01:01.00" to delete records, our api at present doesn't lend well to this but we could add it as an admin-style endpoint, need to think on this one some more - open to ideas here. thanks for surfacing this.

chryswoods commented 6 years ago

Thanks @rdallman - I will add the short term DELETE as you suggest. I can also short term add in a layer above Fn that marks records for later deletion for specific jobs when the user has moved files. I'm happy to feed back how this works for my use-case, so that you can see whether or not this would be useful in mainline.

rdallman commented 5 years ago

calls are removed from API now