IDSIA / sacred

Sacred is a tool to help you configure, organize, log and reproduce experiments developed at IDSIA.
MIT License
4.25k stars 383 forks source link

Update artifact if the name is the same #685

Open xserban opened 5 years ago

xserban commented 5 years ago

Hi, is it possible to update an artifact if we add an artifact with the same name several times? I am currently using the mongodb observer and when I add the same artifact twice it is saved twice.

I would like to store intermediate results, but have only the final versions, not the versions in between.

Thanks, cheers!

Guptajakala commented 5 years ago

+1

jakubczakon commented 4 years ago

If you use NeptuneObserver the artifact file with the same name gets overwritten by default.

Check out the readme.md or neptune-contrib docs if interested.

laughingrice commented 4 years ago

Also wondering if there is a way to overwrite observed data with the mongodb observer rather than having multiple versions.

I tried looking into Neptune, it looks like it is only available as an external service, which is a big no-no for me at the moment (HIPAA rules). I am interested in storing checkpointing data, but only need the last set, and not everything along the way (a lot of data that grows very quickly), and would like to have it attached to the data (the only reason to migrate to sacred in an attempt to avoid building my own database setup). It looks like the file observer overwrite the data, but it's very hard to browse the outputs (I need to go one by one to find the correct one).

Thanks

laughingrice commented 4 years ago

I started looking into the code to see what it would take to change things. Looks like the function of interest is sacred/observers/mongo.py:artifact_event(266)

It should be easy enough to check self.run_entry["artifacts"] to see if an entry for the said file name exists and replace it, it looks like the question is the use of

self.fs.put (where self.fs is an instance of gridfs.GridFS)

which I'm guessing generates a new file each time, so it won't help to delete the entry without deleting the file first

laughingrice commented 4 years ago

I think I found a solution (doesn't give a choice right now, just replaces files with the same name - matches ). No new files show up, the DB keeps growing, but very slowly, and from what I gathered, it seems to be related to the way mongodb works (doesn't shrink in size by default) rather than extra files (unrelated - deleting experiments through omniboard doesn't free up space either). (I can create a pull request if there is interest in this change)

diff --git a/sacred/observers/mongo.py b/sacred/observers/mongo.py
index c4dc73d..b95ba5c 100644
--- a/sacred/observers/mongo.py
+++ b/sacred/observers/mongo.py
@@ -298,6 +298,13 @@ class MongoObserver(RunObserver):
         self.save()

     def artifact_event(self, name, filename, metadata=None, content_type=None):
+        # Check first if a file with the same name is already in the database
+        # Delete it and remove it from the artifacts
+        for i, e in enumerate(self.run_entry["artifacts"]):
+            if name == e['name']:
+                self.fs.delete(e['file_id'])
+                self.run_entry["artifacts"].pop(i)
+
         with open(filename, "rb") as f:
             run_id = self.run_entry["_id"]
             db_filename = "artifact://{}/{}/{}".format(self.runs.name, run_id, name)
jakubczakon commented 4 years ago

@laughingrice

I see you found a solution, that is good.

Just wanted to say that regarding:

I tried looking into Neptune, it looks like it is only available as an external service, which is a big no-no for me at the moment (HIPAA rules).

You can have Neptune deployed on-premise to be compliant with those. We just send you a VM + instructions to test it out (and instructions for scalable deployment on a Kubernetes Cluster if you want after).

laughingrice commented 4 years ago

@jakubczakon

You can have Neptune deployed on-premise to be compliant with those. We just send you a VM + instructions to test it out (and instructions for scalable deployment on a Kubernetes Cluster if you want after).

Thank you. I did not see a message through github. I did get an email from Kamil at Neptune about setting up an academic account, but it didn't mention on-premise deployment.

gkanwar commented 9 months ago

We have recently found this feature to be quite useful, and I would love to see this feature upstreamed -- see PR #925 . Sorry for resurrecting an old issue. Happy to update the PR to address any issues/requests.