Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.12k stars 3.37k forks source link

`NeptuneCallback` produces lots of `X-coordinates (step) must be strictly increasing` errors #20281

Open iirekm opened 3 weeks ago

iirekm commented 3 weeks ago

Bug description

When Optuna is run in parallel mode (n_jobs=-1), with NeptuneCallback, I get: [neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: trials/values. Invalid point: 0.0 It's normal that during parallel or distributed hyperparam optimization, information become unordered. Either Neptune should support adding steps out of order, or NeptuneCallback should support it somehow (e.g. by using an artificial step number).

What version are you seeing the problem on?

v1.x

How to reproduce the bug

study.optimize(..., callbacks=[NeptuneCallback(run)], n_jobs=-1)

Error messages and logs

[neptune] [error ] Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute: trials/values. Invalid point: 0.0

Environment

Any multi-threaded environment.

More info

No response

guttikondaV commented 1 week ago

Hi @iirekm. I had this same problem when working with Neptune. I was logging metrics during the train, val and test phases. I later realized that I was using the same names for the metrics in the metrics dictionary. Sometimes I was even suing the same Torchmetrics instance in all the phases. Perhaps you're doing the same and could you check it again? I am not a pro at this. Just hoping that it is the same gotcha as mine. Sorry if it doesn't work.