IDSIA / sacred

Sacred is a tool to help you configure, organize, log and reproduce experiments developed at IDSIA.
MIT License
4.26k stars 383 forks source link

info.json is not be saved #830

Open Sud0x67 opened 3 years ago

Sud0x67 commented 3 years ago

Hi, I use sacred for my AI experiments and it help me a lot. But recently I found something is wrong with sacred. I use the info dict to save some results of my experiments and uausally it works well. But sometimes the info.json is not saved or only half of it is saved. Is there any solution? The version of sacred I use is 0.8.2 and on python 3.7.

thequilo commented 3 years ago

Hi, I assume you use the FileStorageObserver. The info dict is only saved in the heartbeat events, but not on interruption, failed, or completed events. The heartbeat should usually be stopped correctly so that all data is written, but what you report looks like this is not always the case. Does this happen only for failed/interrupted experiments or also for experiments that finished correctly?

Sud0x67 commented 3 years ago

Yes, I use the FileStorageObserver. This happens for experiments that finished correctly, somtimes but not always. Other files including cout.txt, config.json, and run.json are saved correctly except info.json.

thequilo commented 3 years ago

Do you have a minimal example that reproduces this issue? It seems to work for me.

The heartbeat events are processed in a background thread. It could be that this thread dies, for some reason, before it can perform the final write.

Sud0x67 commented 3 years ago

Thanks for your reply. I am so sorry that I can't provide a minimal example because I am dealing with a complex project about MARL. My project is based on this repo and command python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=3m can reproduces this issue. However, it is not easy to figure out the code and reproduce this issue. I will comment here if I have any idea about this issue. Thanks for your help!

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

vnmabus commented 2 years ago

I have the same issue (Sacred 0.8.2). Is the info dict not saved on completion? That sounds like a bug.

thequilo commented 2 years ago

The info dict is not saved on completion. It is not passed to the completed_event of the observer. I don't know exactly why this is the case. I guess the idea was that if the heartbeat event is executed correctly, then there is no need to save the info dict in the completion event because it does not change between the last heartbeat and the completion. But if the heartbeat fails, this assumptions is no longer true.

@vnmabus do you have a minimal example to reproduce the issue? Or does it only appear in larger experiments?

vnmabus commented 2 years ago

For now only a few times, and in medium to large experiments in the cluster. I have put a sleep(11) call to patch it for now, but that is not ideal, and I still have to relaunch the failed experiments.

vnmabus commented 1 year ago

This should be saved on completion. I have lost countless human and computing time by relaunching half completed experiments because of this.

thequilo commented 1 year ago

That's really unfortunate. Do you have extremely large data in your info.json? Maybe it gets killed if the write for the heartbeat takes longer than processing the completed event. In that case, we could make the main thread wait longer for the background heartbeat thread. There currently is a timeout of 2s in https://github.com/IDSIA/sacred/blob/17c530660d5b405af0f5c286b1a93f3d8911d026/sacred/run.py#L288. Increasing or removing this could solve the issue.

Saving this information on completed is not as easy as it sounds because it is a breaking change and could create a race condition with the background thread (right?). But it could still be better than half-saved files.

vnmabus commented 1 year ago

Yes, I have large data in info (I store all of train and test scores and times).

My proposal was to join the heartbeat thread. I was not aware that this was done using a timeout. What is the reason for that? Can the heartbeat not stop?

thequilo commented 1 year ago

I don't know the reason. It was introduced here: https://github.com/IDSIA/sacred/commit/95234cdf41b4be1ec2810980decf3ad76aaeb187 which seems to be addressing this issue: https://github.com/IDSIA/sacred/issues/273.

I believe that there is no reason for the FileStorageObserver to hang on heartbeat, but the MongoObserver seems to have issues where it sometimes doesn't exit. But I only use the FileStorageObserver, so I can't confirm. But even in that case, I would argue that a hanging experiment script is better than broken files. At least then it is obvious that something went wrong