comet-ml / issue-tracking

Questions, Help, and Issues for Comet ML
https://www.comet.ml
85 stars 7 forks source link

Hyperparmeters are not being logged at times when using with distributed pytorch #450

Closed nsriniva03 closed 1 year ago

nsriniva03 commented 2 years ago

Describe the Bug

Hello, I am using comet.ml with distributed pytorch. When the program is executed the model is initialized on N GPUs and comet.ml starts the corresponding N experiments. However, comet.ml is logging the hyperparameters only on some of the GPUs/experiments and not on all. Why would this happen and what does it mean?

Expected behavior

I would expect the hyperparameters to be logged in all the experiments.

Where is the issue?

Screenshots or GIFs

Screen Shot 2021-12-03 at 4 09 40 PM

For experiment typical_root_7164, the hyperparameters are not logged.

Screen Shot 2021-12-03 at 4 27 41 PM

Whereas for experiment surviving_seasoining_1118, the hyperparameters are logged.

Screen Shot 2021-12-03 at 4 27 24 PM

Additional context

Add any other context about the problem here.

dsblank commented 2 years ago

Do they eventually show up after the experiment has finished running?

nsriniva03 commented 2 years ago

I always kill the program and restart it. I do this till all the parameters are visible on all the experiments. But I could let it run and get back to you about it. It just seemed strange that it would log for some experiments and not others.

dsblank commented 2 years ago

Some things don't log until the end, so it isn't a good idea to kill an experiment. If at all possible, the experiment should run until completion. If you want, you can call experiment.end() manually in your code.

nsriniva03 commented 2 years ago

Hi Douglas,

The parameters are not showing up after the experiment has finished running.

appalling_aracde has not logged any parameters. Screen Shot 2021-12-06 at 8 56 27 AM Screen Shot 2021-12-06 at 8 57 19 AM

dsblank commented 2 years ago

Some additional questions:

  1. Did the hyperparameters ever did show up (even after a browser refresh)? Sometimes the server takes a few minutes to process everything.
  2. Did all of these really ran for 44 hours, 3 minutes, and 30-some seconds? I don't think I've ever seen such consistency across computers for that long. Are they reporting a lot of data over that time, or all at once? An experiment could get throttled on various limits. See: https://www.comet.ml/docs/python-sdk/warnings-errors/#rate-limits
  3. Do you have the output captured for the experiment(s) that aren't showing the hyperparameters? I'm wondering if they had connection issues or crashed. Are these links you can share with me in DM? I'm doug@comet.ml
github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 5 days with no activity.