IIIS-Li-Group / OpenFE

OpenFE: automated feature generation with expert-level performance
MIT License
781 stars 99 forks source link

Very high latency in even modest datasets #39

Open twistedcubic opened 1 year ago

twistedcubic commented 1 year ago

Hello OpenFE authors,

We would appreciate some help getting the code to finish within a reasonable amount of time. E.g. on a dataset with ~7K samples, OpenFE generates 1.3 million candidate transforms, and takes many hours to run, even on a high-memory 96-core machine (all cores are used). We already tried all suggestions outlined on https://openfe-document.readthedocs.io/en/latest/parameter_tuning.html. On larger datasets, OpenFE oftentimes never finishes or crashes.

For instance, we cannot reproduce the latency results reported in the paper.

We are following examples used in this repository. Was there anything done differently for the paper that we can use to run the code in this repo?

Thanks, Yihe

ZhangTP1996 commented 12 months ago

Without the code or datasets provided, it is hard to diagnose the reasons. According to my previous experience, for such a dataset with few samples and millions of candidate features, OpenFE should terminate within one or two hours with 32 cores. Here are two possible directions: 1. Do not use all the cores. Using all the cores may slow down lightgbm. Please try 48. 2. Try feature selection to reduce the number of base features.

twistedcubic commented 12 months ago

Thanks for your reply.

Here is how we are calling OpenFE. We tried a variety of datasets to make sure it was not dataset specific, for instance Isolet, Ames housing, MNIST, etc. We tried using different numbers of cores.

The 1.3 million candidate features I quoted above was generated by OpenFE for Isolet, which has ~600 features. Do you have suggestions on how to reduce this 1.3 million candidate features generated?

Alternatively, what feature selection method do you recommend? We want OpenFE to perform as well as possible, within a reasonable amount of time, i.e. hours, not tens of hours.

Yihe

MonoHue commented 12 months ago

UPDATE

I succeeded in running OpenFE on a dataset with shape (5000000, 200) in about 15 hours. It consumed about 200GB RAM.

I modified it as follow:

  1. reduce stage1_ratio at here.
  2. reduce number of candidate features calculated per process in multiprocess like this.
  3. further reduce n_jobs (7 in my case). Increase n_jobs of gbm at here to compensate for the performance loss

Hi,

I have a similar problem. I have a dataset with shape (570000, 200). OpenFE suggested 220000 new features. I am running it on a 64 REAL cores machine with 500GB RAM with n_jobs=15. By the time about 110000 new features have been computed, the memory is completely used up. Then, most of memory is freed and the program does not appear to be continuing.

Here's the CPU and MEM log

Screenshot 2023-11-21 at 12 30 21 PM

The program stopped where memory is fully occupied

Screenshot 2023-11-21 at 12 37 33 PM

Filtering features is hard in my case, is there any optimization to reduce memory usage?

Thanks

ZhangTP1996 commented 11 months ago

UPDATE

I succeeded in running OpenFE on a dataset with shape (5000000, 200) in about 15 hours. It consumed about 200GB RAM.

I modified it as follow:

  1. reduce stage1_ratio at here.
  2. reduce number of candidate features calculated per process in multiprocess like this.
  3. further reduce n_jobs (7 in my case). Increase n_jobs of gbm at here to compensate for the performance loss

Hi,

I have a similar problem. I have a dataset with shape (570000, 200). OpenFE suggested 220000 new features. I am running it on a 64 REAL cores machine with 500GB RAM with n_jobs=15. By the time about 110000 new features have been computed, the memory is completely used up. Then, most of memory is freed and the program does not appear to be continuing.

Here's the CPU and MEM log Screenshot 2023-11-21 at 12 30 21 PM

The program stopped where memory is fully occupied Screenshot 2023-11-21 at 12 37 33 PM

Filtering features is hard in my case, is there any optimization to reduce memory usage?

Thanks

Apologies for the delayed response. I would suggest downsampling your dataset to 50K samples prior to inputting it into OpenFE. Based on my experience, this method of downsampling is unlikely to significantly impact the overall performance. Your solution is also very practical.