Incremental Learning or partial_fit?

automl / auto-sklearn

Automated Machine Learning with scikit-learn

https://automl.github.io/auto-sklearn

BSD 3-Clause "New" or "Revised" License

7.62k stars 1.28k forks source link

Incremental Learning or partial_fit? #1567

Open luckyhug opened 2 years ago

luckyhug commented 2 years ago

Does auto-sklearn support incremental learning or partial_fit?

when the dataset is too big for the RAM, (About 230+GB, although I can store it in a list, there's not enough memory to convert the list to an np array)

Is there any advice or examples on dealing with this dataset?

Thank you very much!

eddiebergman commented 2 years ago

Hi @luckyhug,

No I think there's not much way to effectively use that much data in autosklearn natively. My only suggestion would be to run auto-sklearn on a subsample of that data and use show_models() to inform which models and hyperparameters to use for the next step of your pipeline in terms of which configurations to fit and handle the incremental learning and partial fitting in a custom manner.

Best, Eddie

jonaslandsgesell commented 2 years ago

@luckyhug did you consider down casting numerical values from e.g. float64 to float16 already? This could reduce your memory consumption by a factor of 4.

eddiebergman commented 2 years ago

@jonaslandsgesell good idea! We automatically do that already if the dataset is too large. We also automatically subsample the data if it's too large to fit in memory, but this means all of that original data can not be used then.