Open xserban opened 5 years ago
+1
If you use NeptuneObserver
the artifact file with the same name gets overwritten by default.
Check out the readme.md
or neptune-contrib docs if interested.
Also wondering if there is a way to overwrite observed data with the mongodb observer rather than having multiple versions.
I tried looking into Neptune, it looks like it is only available as an external service, which is a big no-no for me at the moment (HIPAA rules). I am interested in storing checkpointing data, but only need the last set, and not everything along the way (a lot of data that grows very quickly), and would like to have it attached to the data (the only reason to migrate to sacred in an attempt to avoid building my own database setup). It looks like the file observer overwrite the data, but it's very hard to browse the outputs (I need to go one by one to find the correct one).
Thanks
I started looking into the code to see what it would take to change things. Looks like the function of interest is sacred/observers/mongo.py:artifact_event(266)
It should be easy enough to check self.run_entry["artifacts"] to see if an entry for the said file name exists and replace it, it looks like the question is the use of
self.fs.put (where self.fs is an instance of gridfs.GridFS)
which I'm guessing generates a new file each time, so it won't help to delete the entry without deleting the file first
I think I found a solution (doesn't give a choice right now, just replaces files with the same name - matches ). No new files show up, the DB keeps growing, but very slowly, and from what I gathered, it seems to be related to the way mongodb works (doesn't shrink in size by default) rather than extra files (unrelated - deleting experiments through omniboard doesn't free up space either). (I can create a pull request if there is interest in this change)
diff --git a/sacred/observers/mongo.py b/sacred/observers/mongo.py
index c4dc73d..b95ba5c 100644
--- a/sacred/observers/mongo.py
+++ b/sacred/observers/mongo.py
@@ -298,6 +298,13 @@ class MongoObserver(RunObserver):
self.save()
def artifact_event(self, name, filename, metadata=None, content_type=None):
+ # Check first if a file with the same name is already in the database
+ # Delete it and remove it from the artifacts
+ for i, e in enumerate(self.run_entry["artifacts"]):
+ if name == e['name']:
+ self.fs.delete(e['file_id'])
+ self.run_entry["artifacts"].pop(i)
+
with open(filename, "rb") as f:
run_id = self.run_entry["_id"]
db_filename = "artifact://{}/{}/{}".format(self.runs.name, run_id, name)
@laughingrice
I see you found a solution, that is good.
Just wanted to say that regarding:
I tried looking into Neptune, it looks like it is only available as an external service, which is a big no-no for me at the moment (HIPAA rules).
You can have Neptune deployed on-premise to be compliant with those. We just send you a VM + instructions to test it out (and instructions for scalable deployment on a Kubernetes Cluster if you want after).
@jakubczakon
You can have Neptune deployed on-premise to be compliant with those. We just send you a VM + instructions to test it out (and instructions for scalable deployment on a Kubernetes Cluster if you want after).
Thank you. I did not see a message through github. I did get an email from Kamil at Neptune about setting up an academic account, but it didn't mention on-premise deployment.
We have recently found this feature to be quite useful, and I would love to see this feature upstreamed -- see PR #925 . Sorry for resurrecting an old issue. Happy to update the PR to address any issues/requests.
Hi, is it possible to update an artifact if we add an artifact with the same name several times? I am currently using the mongodb observer and when I add the same artifact twice it is saved twice.
I would like to store intermediate results, but have only the final versions, not the versions in between.
Thanks, cheers!