TUM-DAML / seml

SEML: Slurm Experiment Management Library
Other
165 stars 29 forks source link

Bug: Deleting a failed experiment does not delete all saved source files in the mongodb. #95

Closed saper0 closed 10 months ago

saper0 commented 2 years ago

When deleting a failed experiment, not all associated saved source files in the mongodb collections fs.files and fs.chunks are deleted. Only those are deleted, which have been stored when staging the experiment and not those stored when starting/running it (i.e. the actual experiment script). The (only) consequence of the bug is a cluttering of those two collections over time.

Expected Behavior

Delete all source files associated to an experiment in the mongodb. This includes the source-files saved during staging (in the mongodb collection listed under seml->source_files) and those saved when running the experiment (in the mongodb collection under experiment->sources).

Actual Behavior

Only the entries in fs.files and fs.chunks are deleted, which correspond to the source files saved during staging and listed in seml->source_files.

Steps to Reproduce the Problem

  1. Count elements in fs.files and fs.chunks collection
  2. Add an experiment (which will fail) using seml mycollection add myconfig
  3. Run the experiment using seml mycollection start
  4. Delete the experiment using seml mycollection delete
  5. Count/inspect elements in fs.files and fs.chunks collection

Specifications

Details - Version: 0.3.6 - Python version: 3.9.7 - Platform: Linux and Mac OS
danielzuegner commented 2 years ago

Hi, Thanks for opening this issue. I have some clarifying questions.

What do you mean by "those saved when running the experiment"? Are you referring to the source files (optionally) uploaded by sacred?

Currently, we're only cleaning up those files that seml actually uploads. Could there be unintended side effects that we're deleting source files (externally added by another tool) that the user does not want/ expect to be deleted?

saper0 commented 2 years ago

I do refer to the source files uploaded by seml when staging the experiment and the source files uploaded by sacred when running the experiment.

As seml is built upon sacred and deleting an experiment in seml means deleting the corresponding mongodb entry including all the information in that entry of the sacred experiment, I would expect seml to also delete the source files saved by sacred. I do not see any unintended side effects when deleting these sources, as they are orphaned in a sense that no corresponding seml/sacred experiment exists anymore in the mongodb.

As an example of another project doing this: omniboard directly connects to the mongodb and displays the experiments saved by sacred (or also seml :)). If you delete a sacred experiment with omniboard, it does not only delete the mongodb entry in the used collection but also automatically deletes the saved sources.

heborras commented 2 years ago

Just to add to this: As far as I can tell from looking at the code. The issue likely also appears with artifacts, which were added with sacred during the experiment. Since these artifacts can be relatively big compared to source files the issue could bloat the database very quickly.

heborras commented 2 years ago

Since the left over artifacts were taking up large amounts of space in our database I wrote a small script to clean up the DB. You can take a look and try it yourself here: https://gist.github.com/HenniOVP/fc2e54ea56abaf291ee8dab17b5e5f19

It appears to work as intended, but I would advise caution when using it, since deleted files are gone permanently.

danielzuegner commented 2 years ago

See this PR. It addresses both the issue of source files added by Sacred and also improves the workflow for purging orphaned files from the MongoDB (similar in spirit to the notebook linked by @HenniOVP). Feel free to comment on the PR :)

n-gao commented 2 years ago

@saper0, @HenniOVP can this issue be closed?

heborras commented 2 years ago

From my perspective you can close the issue. Since the PR by @danielzuegner seems to have resolved the problem :)