Closed yuanqing-wang closed 4 years ago
How much memory usage is expected, and how much is happening?
Note that Train stores all iterates in memory, in the states
dictionary, so I'd expect this to use n_epochs
x size(model)
memory.
Yes, the state dictionaries are all saved.
One state dictionary takes around 500K. Even stored each iteration, 2000 iterations will only give us around 100M.
Added context: https://github.com/choderalab/pinot/blob/infrastructure/scripts/heteroscedastic_vs_homoscedastic/run.py
reproduces this. @yuanqing-wang can confirm
My suspicion is that when you call result = train_and_test.run()
which invokes TrainAndTest.run
,
you not only copy the value of "result" which is a list of MSE/error numbers but also the entire computation graph. Because when pytorch uses a variable, that variable is a node, part of a huge computation graph and this takes up a lot of memory. So perhaps, you need to explicitly extract the value of the variable result
`
result = train_and_test.run()
results.append((param_dict, result))
del train_and_test
self.results = results
return self.results`
Thanks! tried but still have the same issue
@dnguyen1196
You're right! Sorry I misunderstood your point. The metrics weren't detach()
ed in the previous implementation. So I was copying giant computation graphs thousands of times.
Fixed here a6a41d4780d0663a3dd6db3dc6a41d3921c95276
Because when pytorch uses a variable, that variable is a node, part of a huge computation graph and this takes up a lot of memory. So perhaps, you need to explicitly extract the value of the variable result
Good sleuthing!
Fixed here a6a41d4
Do you ever need to take derivatives of things in metrics.py
? Or are these used only for evaluation and report-generation? (The comment at the top of metrics.py
says these might also be used for training.)
moved the detach step in the report generation step
seems that this is taking unexpectedly large memory. https://github.com/choderalab/pinot/blob/9532b235ae1613ab71d7bcddde79dfacca357abf/pinot/app/experiment.py#L134