LMZimmer / Auto-PyTorch_refactor

Apache License 2.0
0 stars 2 forks source link

Improve the way we decide if all data can fit in memory #49

Open franchuterivera opened 3 years ago

franchuterivera commented 3 years ago

The following attribute decides if all data can fit in memory and hence, we will load it and pre-process it before fit:

https://github.com/LMZimmer/Auto-PyTorch_refactor/blob/ac7a9ce35e87a428caca2ac108b362a54d3b8f3a/autoPyTorch/datasets/base_dataset.py#L76

If such is not the case, we load it in batches through the data loader.

We can proactively use the memory_usage feature from pandas to estimate the consumption of the full dataset.

To decide if a dataset fits in the given memory, one can use the memory limit specified in the fit dictionary to come up with how much memory we can use (or default to psutils memory query to see the total memory in the system). The burden of fitting a pipeline (regarding to memory) is a factor of the dataset size. We can create a heuristic to say how much memory a pipeline cost, but in reality, this is hard to measure. This feature of pre-processing the whole data, to my personal criteria, should be done for a really small dataset. So if a dataset consumes say 'half' or a 'third' of the total virtual memory, we should turn this on.

To decide on whether we do it when the dataset consumes half, a third, a fourth can be empirically measured. Let me know if you have a better idea!