anniesch / jtt

Code for "Just Train Twice: Improving Group Robustness without Training Group Information"
66 stars 16 forks source link

Memory Leak #10

Open dobulexyz opened 8 months ago

dobulexyz commented 8 months ago

Whenever I run either the ERM or upweighted training routines I encounter a memory leak during the training epochs. There is no memory leaked during the validation or test epochs.

Initial runs leak around 450MB per epoch Upweighted runs leak around 2410MB per epoch In both initial and JTT runs, which have the same batch size of 64, leak roughly 180KB per batch.

There are some training-only instructions in the run_epoch function in train.py that involve the loss calculator and the csv logger. I'm pretty confident that there's not thing in the csv logger code that would cause a memory leak. I'm less confident about the loss calculator, however I've yet to find anything that seems like it would leak memory.

alainray commented 3 months ago

I got a memory leak when training on MultiNLI. I changed the 'num_workers' parameter in the dataloader to 0. Now it works! I don't know if that helps!

dobulexyz commented 2 months ago

This was when training on CelebA. I fixed it though. I believe they forgot to relinquish a tensor in the data logger.