Dataloader workers created subprocess more and more after each epoch

NVIDIA / sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification

Other

1.06k stars 202 forks source link

Dataloader workers created subprocess more and more after each epoch #60

Open hjldegit opened 5 years ago

hjldegit commented 5 years ago

Dataloader workers created more and more subprocess in every epoch. this led to severe memory loss after each epoch end. Finally my CPU memory used up and the program crashed.

The only thing i can do is trying to set 1 epoch and more iters. But the dataloader seems also exist memory leak slowly, and i do not know it's due to the pytorch dataloader or shardloader.

Thank you very much.

raulpuric commented 5 years ago

I think it's because of the ShardLoader structure we have. We'll take a look, thanks for bringing this to our attention. I think we can solve the memory leak for 1 epoch, but I don't know about the sub-processes per epoch.

Also if you're trying to reproduce our results for mLSTM training you can use one of our earlier releases from before we merged it with our transformer code.

hjldegit commented 5 years ago

I think it's because of the ShardLoader structure we have. We'll take a look, thanks for bringing this to our attention. I think we can solve the memory leak for 1 epoch, but I don't know about the sub-processes per epoch. Also if you're trying to reproduce our results for mLSTM training you can use one of our earlier releases from before we merged it with our transformer code.

thanks for reply, do you have any idea to solve the memory leak for 1 epoch?

raulpuric commented 5 years ago

Sorry for the late reply. We did some digging around and weren't able to find any memory leaks for 1 epoch. Did you have a memory leak in gpu or cpu memory?

We were able to track down the multi epoch process bomb you experienced. It's due to this creation of a shardloader manager at every epoch. https://github.com/NVIDIA/sentiment-discovery/blob/master/data_utils/loaders.py#L207. We need to add better garbage collection.

The best way to get around this is to have one iterator you use across multiple epochs, instead of creating a new iterator (for batch in dataloader) every epoch. However, this is quite similar to your proposed solution of having 1 epoch with more iterations.

hjldegit commented 5 years ago

The main issue is the sub-process bomb.

And I observe the memory leak for 1 epoch which increase very slowly, but the memory used will reach the upper limit until current epoch end, and then it start to increase due to the sub-process increase.

Now the memory used is stable through fix the sub-process issue. Thanks for you help.