memory leak in `pinot.TrainAndTest`?

choderalab / pinot

Probabilistic Inference for NOvel Therapeutics

MIT License

15 stars 2 forks source link

memory leak in `pinot.TrainAndTest`? #23

Closed yuanqing-wang closed 4 years ago

yuanqing-wang commented 4 years ago

seems that this is taking unexpectedly large memory. https://github.com/choderalab/pinot/blob/9532b235ae1613ab71d7bcddde79dfacca357abf/pinot/app/experiment.py#L134

maxentile commented 4 years ago

How much memory usage is expected, and how much is happening?

Note that Train stores all iterates in memory, in the states dictionary, so I'd expect this to use n_epochs x size(model) memory.

yuanqing-wang commented 4 years ago

Yes, the state dictionaries are all saved.

One state dictionary takes around 500K. Even stored each iteration, 2000 iterations will only give us around 100M.

dnguyen1196 commented 4 years ago

Added context: https://github.com/choderalab/pinot/blob/infrastructure/scripts/heteroscedastic_vs_homoscedastic/run.py

reproduces this. @yuanqing-wang can confirm

dnguyen1196 commented 4 years ago

My suspicion is that when you call result = train_and_test.run() which invokes TrainAndTest.run,

you not only copy the value of "result" which is a list of MSE/error numbers but also the entire computation graph. Because when pytorch uses a variable, that variable is a node, part of a huge computation graph and this takes up a lot of memory. So perhaps, you need to explicitly extract the value of the variable result

`
result = train_and_test.run() results.append((param_dict, result)) del train_and_test

   self.results = results

   return self.results`

yuanqing-wang commented 4 years ago

Thanks! tried but still have the same issue

yuanqing-wang commented 4 years ago

@dnguyen1196

You're right! Sorry I misunderstood your point. The metrics weren't detach()ed in the previous implementation. So I was copying giant computation graphs thousands of times.

Fixed here a6a41d4780d0663a3dd6db3dc6a41d3921c95276

maxentile commented 4 years ago

Because when pytorch uses a variable, that variable is a node, part of a huge computation graph and this takes up a lot of memory. So perhaps, you need to explicitly extract the value of the variable result

Good sleuthing!

Fixed here a6a41d4

Do you ever need to take derivatives of things in metrics.py? Or are these used only for evaluation and report-generation? (The comment at the top of metrics.py says these might also be used for training.)

yuanqing-wang commented 4 years ago

moved the detach step in the report generation step