NeuroTheoryUMD / NDNT

1 stars 2 forks source link

Potential memory leak in LBFGS training #6

Open jcbyts opened 2 years ago

jcbyts commented 2 years ago

LBFGS training runs out of memory after a few iterations (epochs?) and might have a leak. It should be able to fit on the GPU (works in NDN tensorflow) but crashes in the pytorch version

dbutts commented 2 years ago

Some updates: both bugs using LBGFS with my "whisker" dataset and AdamW with my color dataset have to do with memory problems on the GPU. [this is perhaps obvious because they are these weird "cudaNN" errors that google will tell you have to do with the GPU running out of memory.

The (perhaps minimal) insights here stem from two observations:

  1. forcing the model to fit using the CPU -- with same dataset/model parameters etc of course works. In particular, full-batch LBFGS seems to work fine (the whole dataset at once). So might not be a problem, per se, with the LBFGS implementation at all, but rather GPU management.

  2. It does not seem to be a problem with running out of GPU memory. (this is more fuzzy). Actually this is based on a a few observations. -- The 2-D color dataset (Bevil's data) will fit fine on the CPU, but crashes immediately with very small batches -- The whisker dataset easily fits on the GPU with the NDN code (full batch LBFGS on the GPU). It is not a large dataset: certainly not 11 GB. (only two "pixels" over time) -- To get the whisker dataset to run at all with the NDNT, I have to choose a relatively small batch size. When it does run, it only takes up ~2 GB (a fraction of the total memory) but crashes after making it 40% through the data. When it does, the GPU is still mostly empty, but of course I have to restart the kernel to restore the memory. -- I can get get through the whisker data using AdamW with small batch sizes no problem: so it is related to the size of the batch: just somehow not in a way that clearly relates to the full memory available on the GPU.

So, I would say things point to memory management of the GPU by the new trainer.