Open andychisholm opened 1 year ago
Hey @andychisholm! Thanks for submitting the issue with such details. Really appreciate that 🙌 I'll try to reproduce it, please do expect some questions during that process.
@alberttorosyan any thoughts on this one? Even a potentially fruitful direction to explore when debugging would be useful
@andychisholm, I don't have good evidence on what's happening yet. The only possible thing which comes to my mind is following:
AimLogger
instance is available on rankN
s. The intended use of the logger is to run only on rank0
.
I'll continue looking into this. Any additional information would be a huge help!
I'm seeing the same issue. My aim repo is also remote. Seems related to this: https://github.com/Lightning-AI/lightning/issues/8821
Just to follow up on this one, I think it's to do with a lack of forking support for the GRPC client. Regardless of whether the aim loggers are used in sub-processes they blow up the data loaders in various non-deterministic ways.
For example, if you do a DDP train with multiple GPUs and multiple dataloader workers per GPU this occurs, but if you switch the start method from the default fork
to spawn
then it's mitigated.
I can also confirm this issue. It is related to https://github.com/aimhubio/aim/issues/1297 I am guessing.
🐛 Bug
We're seeing a non-deterministic error which occurs during a torch lightning train when we adopt a remote AIM repo for logging (i.e. setting
repo="aim://our-aim-server:53800/"
when initializing aaim.pytorch_lightning.AimLogger
.This only happens when switching from a local
AimLogger
to remote repo, with no other changes to the codebase.Mitigations
num_workers
on torch data loaders is reduced to 0, the issue does not reproduce. So it seems to be multi-processing relatedIt's difficult to see how Aim is involved at all in the data loader pipeline to produce a relationship like this, but this is what we can observe.
Error Detail
During the first epoch we typically see one to many:
Immediately followed by data loader abort stack traces, e.g:
To reproduce
Unable to provide a minimal reproduction at this stage.
Appreciate this is going to be incredibly difficult to debug! Just hoping someone's seen something like this before.
Expected behavior
Aim logger initialisation should not cause torch data loader deadlocks.
Environment
3.8.10
23.0
Ubuntu 20.04.5 LTS