Open twistedcubic opened 1 year ago
Without the code or datasets provided, it is hard to diagnose the reasons. According to my previous experience, for such a dataset with few samples and millions of candidate features, OpenFE should terminate within one or two hours with 32 cores. Here are two possible directions: 1. Do not use all the cores. Using all the cores may slow down lightgbm. Please try 48. 2. Try feature selection to reduce the number of base features.
Thanks for your reply.
Here is how we are calling OpenFE. We tried a variety of datasets to make sure it was not dataset specific, for instance Isolet, Ames housing, MNIST, etc. We tried using different numbers of cores.
The 1.3 million candidate features I quoted above was generated by OpenFE for Isolet, which has ~600 features. Do you have suggestions on how to reduce this 1.3 million candidate features generated?
Alternatively, what feature selection method do you recommend? We want OpenFE to perform as well as possible, within a reasonable amount of time, i.e. hours, not tens of hours.
Yihe
UPDATE
I succeeded in running OpenFE on a dataset with shape (5000000, 200) in about 15 hours. It consumed about 200GB RAM.
I modified it as follow:
stage1_ratio
at here.number of candidate features
calculated per process in multiprocess like this.n_jobs
(7 in my case). Increase n_jobs
of gbm at here to compensate for the performance lossHi,
I have a similar problem. I have a dataset with shape (570000, 200). OpenFE suggested 220000 new features. I am running it on a 64 REAL cores machine with 500GB RAM with n_jobs=15. By the time about 110000 new features have been computed, the memory is completely used up. Then, most of memory is freed and the program does not appear to be continuing.
Here's the CPU and MEM log
The program stopped where memory is fully occupied
Filtering features is hard in my case, is there any optimization to reduce memory usage?
Thanks
UPDATE
I succeeded in running OpenFE on a dataset with shape (5000000, 200) in about 15 hours. It consumed about 200GB RAM.
I modified it as follow:
- reduce
stage1_ratio
at here.- reduce
number of candidate features
calculated per process in multiprocess like this.- further reduce
n_jobs
(7 in my case). Increasen_jobs
of gbm at here to compensate for the performance lossHi,
I have a similar problem. I have a dataset with shape (570000, 200). OpenFE suggested 220000 new features. I am running it on a 64 REAL cores machine with 500GB RAM with n_jobs=15. By the time about 110000 new features have been computed, the memory is completely used up. Then, most of memory is freed and the program does not appear to be continuing.
Here's the CPU and MEM log
The program stopped where memory is fully occupied
Filtering features is hard in my case, is there any optimization to reduce memory usage?
Thanks
Apologies for the delayed response. I would suggest downsampling your dataset to 50K samples prior to inputting it into OpenFE. Based on my experience, this method of downsampling is unlikely to significantly impact the overall performance. Your solution is also very practical.
Hello OpenFE authors,
We would appreciate some help getting the code to finish within a reasonable amount of time. E.g. on a dataset with ~7K samples, OpenFE generates 1.3 million candidate transforms, and takes many hours to run, even on a high-memory 96-core machine (all cores are used). We already tried all suggestions outlined on https://openfe-document.readthedocs.io/en/latest/parameter_tuning.html. On larger datasets, OpenFE oftentimes never finishes or crashes.
For instance, we cannot reproduce the latency results reported in the paper.
We are following examples used in this repository. Was there anything done differently for the paper that we can use to run the code in this repo?
Thanks, Yihe