OOM error caused by minimal memory requirements?

RylanSchaeffer commented 1 year ago

I'm getting an OOM error that allegedly says not enough memory can be found for 1.19 GB. I'm running SLURM jobs with ~80GB.

How can I investigate the cause? Is it possible that previous layers' activations are consuming memory? If so, is there some flag or some mechanism to free that memory?

Traceback (most recent call last):
  File "scripts/compute_eigenspectra_and_fit_encoding_model.py", line 63, in <module>
    activations_extractor=model,
  File "/home/gridsan/rschaeffer/FieteLab-Reg-Eff-Dim/regression_dimensionality/custom_model_tools/eigenspectrum.py", line 44, in fit
    image_transform_name=transform_name)
  File "/home/gridsan/rschaeffer/FieteLab-Reg-Eff-Dim/regdim_venv/lib/python3.7/site-packages/result_caching/__init__.py", line 223, in wrapper
    result = function(**reduced_call_args)
  File "/home/gridsan/rschaeffer/FieteLab-Reg-Eff-Dim/regression_dimensionality/custom_model_tools/eigenspectrum.py", line 141, in _fit
    activations = self._extractor(image_paths, layers=[layer])
  File "/home/gridsan/rschaeffer/FieteLab-Reg-Eff-Dim/regdim_venv/lib/python3.7/site-packages/model_tools/activations/pytorch.py", line 41, in __call__
    return self._extractor(*args, **kwargs)
  File "/home/gridsan/rschaeffer/FieteLab-Reg-Eff-Dim/regdim_venv/lib/python3.7/site-packages/model_tools/activations/core.py", line 43, in __call__
    return self.from_paths(stimuli_paths=stimuli, layers=layers, stimuli_identifier=stimuli_identifier)
  File "/home/gridsan/rschaeffer/FieteLab-Reg-Eff-Dim/regdim_venv/lib/python3.7/site-packages/model_tools/activations/core.py", line 73, in from_paths
    activations = fnc(layers=layers, stimuli_paths=reduced_paths)
  File "/home/gridsan/rschaeffer/FieteLab-Reg-Eff-Dim/regdim_venv/lib/python3.7/site-packages/model_tools/activations/core.py", line 85, in _from_paths
    layer_activations = self._get_activations_batched(stimuli_paths, layers=layers, batch_size=self._batch_size)
  File "/home/gridsan/rschaeffer/FieteLab-Reg-Eff-Dim/regdim_venv/lib/python3.7/site-packages/model_tools/activations/core.py", line 141, in _get_activations_batched
    layer_activations[layer_name] = np.concatenate((layer_activations[layer_name], layer_output))
  File "<__array_function__ internals>", line 6, in concatenate
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 1.19 GiB for an array with shape (1216, 256, 32, 32) and data type float32

mschrimpf commented 1 year ago

The activations of early convnet layers in particular end up being quite large, e.g. we often have to run models with 200-400GB of memory if we want to investigate these early layers. You could ignore these layers, or increase the available memory.

RylanSchaeffer commented 1 year ago

What you said makes sense, but I'm not sure I see the relevance to the error?

In this particular error, the activations have shape (1216, 256, 32, 32) and thus require only 1.19 GiB of memory. That is much less than the total available memory, unless something else is already hogging all the memory. So is something hogging all the memory (and if so, what is it and why is it present) or is something else going awry?

mschrimpf commented 1 year ago

I believe it's attempting to allocate 1.19 GiB additional memory in the concatenation of the arrays (np.concatenate((layer_activations[layer_name], layer_output))). You could profile how big the layer_activations already are up to that point, I'm guessing this does not occur in the first batch of images but rather when accumulating multiple batches. All the activations are needed to compare against the neural recordings later.

brain-score / model-tools

OOM error caused by minimal memory requirements? #68