automl / Auto-PyTorch

Automatic architecture search and hyperparameter optimization for PyTorch
Apache License 2.0
2.37k stars 287 forks source link

Still crazy large mem consumption #19

Closed mlindauer closed 3 years ago

mlindauer commented 4 years ago

Hi,

I tried to run AutoPyTorch again:

from autoPyTorch import AutoNetImageClassification

import numpy as np
import os as os

autonet_image_classification = AutoNetImageClassification(config_preset="full_cs", result_logger_dir="logs/")

path_to_cifar_csv = os.path.abspath("./datasets/CIFAR10.csv")

autonet_image_classification.fit(X_train=np.array([path_to_cifar_csv]),
                                 Y_train=np.array([0]),
                                 min_budget=300,
                                 max_budget=900,
                                 max_runtime=18000,
                                 default_dataset_download_dir="./datasets",
                                 images_root_folders=["./datasets"],
                 log_level="info" )

However, after nearly 2h Auto-PyTorch used more than 60gb RAM.

========== Job Epilogue Start ============
Job Id: 4274574.batch.css.lan
Resources requested by Job: mem=60gb,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=08:00:00
Resources used by Job: cput=01:48:31,mem=77352676kb,vmem=83220752kb,walltime=01:49:07 
Execution host(s): dumbo-n014
Job Exit Status: 271
========== Job Epilogue End ============

Do you have any idea why that's the case? For the first 20min, it used only roughly 7gb.

LMZimmer commented 4 years ago

Not of the top of my head. It is definitely not the same issue as there was before with the Jupiter notebook. I am looking into it.

LMZimmer commented 4 years ago

I couldn't reproduce the issue on slurm with that code. Of 32 GB allocated memory, around 3.5 GB where used most of the time peaking at 5.5 GB. I will add an enforce for memory with pynisher for image data though.

Edit: Could be because I ran on GPU, was your run on CPU only?

mlindauer commented 4 years ago

I installed it again in a virtualbox Ubuntu image (to ensure that it is not the OS and most likely quite similar to your setup), but I can reproduce the problem. For example:

Current children cumulated CPU time: 510.84 s
Current children cumulated vsize: 16804940 KiB
Current children cumulated memory: 10458920 KiB

The peak was at

Current children cumulated memory: 12615168 KiB

At least I could not reproduce the 77GB as reported before... weird...

I run that on a CPU (which surprisingly parallelizes the run without me telling to do that) and use Anaconda 4.7.12 (Python 3.7.4).

franchuterivera commented 3 years ago

Hello, we are closing this issue to track all memory-related problems here https://github.com/automl/Auto-PyTorch/issues/259.