comet-ml / issue-tracking

Questions, Help, and Issues for Comet ML
https://www.comet.ml
85 stars 7 forks source link

Enable Offline ExistingExperiment #348

Closed gauchm closed 10 months ago

gauchm commented 4 years ago

I am running my scripts on a SLURM-scheduled cluster where the compute nodes don't have internet access.

The training script works just fine: I can use an OfflineExperiment. But the subsequent test script (which also doesn't have internet access) is a problem: I'd like to continue the experiment from training, which I would normally do with an ExistingExperiment. But even if I upload the OfflineExperiment after training, I can't create an ExistingExperiment without internet connection.

tl;dr: I need an "OfflineExistingExperiment".

dsblank commented 4 years ago

@gauchm Thanks for the report. I think that you can continue training with another OfflineExperiment, forcing the experiment key to be the previous one by using the COMET_EXPERIMENT_KEY config variable, and, if uploading, using the comet upload --force-reupload ... I'm not sure about this, and you should test to see if their are any side-effects. If that doesn't work (or has any bad side-effects) let us know, and we can find a solution.

gauchm commented 4 years ago

I tried your suggestion like this:

ex = comet_ml.OfflineExperiment(offline_directory='/tmp')
ex.log_other('asdf', 123)
print(ex.get_key())  # prints key like 2f492...
ex.end()

then export COMET_EXPERIMENT_KEY=2f492... then

ex = comet_ml.OfflineExperiment(offline_directory='/tmp')
ex.log_other('qwer', 789)
ex.end()

What seems to happen is that the second experiment just overwrites the first, rather than continuing it. If I examine the created zip-file via comet offline 2f492...zip, there's an entry for qwer, but none for asdf.

If I upload the first experiment between experiments and do a comet upload --force-reupload, the result is the same: The second experiment overwrites the first.

dsblank commented 4 years ago

Thanks for trying this.

I'm making a issue for this so we can work on a solution.

dsblank commented 4 years ago

It looks like the data from the continuing experiment overwrites the first because the step values are repeats. Is it possible for you to add an offset to your steps so that they can pick up where they left off?

gauchm commented 4 years ago

I added an offset via ex.set_step(), but it doesn't seem to help:

Lothiraldan commented 4 years ago

Thank you for your report. Unfortunately, there is no way of having an OfflineExistingExperiment as of today.

We added your request to our roadmap and will keep you posted when we have a solution for it.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 10 months ago

This issue was closed because it has been stalled for 5 days with no activity.