RGF-team / rgf

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.
378 stars 58 forks source link

Support f_ratio? #162

Open fukatani opened 6 years ago

fukatani commented 6 years ago

I found not documented parameter f_ratio in RGF. This corresponding to LightGBM feature_fraction and XGB colsample_bytree.

I tried these parameter with boston regression example. In small max_leaf(300), f_ratio=0.9 improves score to 11.0 from 11.8, but in many max_leaf(5000), f_ratio=0.95 degrared score to 10.34 from 10.19810.

After all, is there no value to use f_ratio < 1.0?

StrikerRUS commented 6 years ago

Please read my conversation with Rie about it in last year.

Thank you for the explanation! My mistake was to think that RGF uses subsets of features and f_ratio isn't abandoned and isn't equal to one by default.

With best wishes, Nikita Titov

От: Rie Johnson ***@***.com Отправлено: 21 августа 2017 г. 21:02 Кому: Titov Nikita Тема: Re: rgf1.2 hidden parameters

Hi Nikita,

Unlike, say, random forests, RGF is deterministic ("R" is for regularized), that is, RGF does not use random numbers.

The random_seed parameter you found would become active only when "f_ratio" is specified. "f_ratio" was meant to be for randomly choosing a subset of features (e.g., with "f_ratio=0.8", only 80% of feature are used each time or something like that), but it didn't do anything good, so "f_ratio" was abandoned.

Best,

Rie

On Mon, Aug 21, 2017 at 11:07 AM, Titov Nikita nekit94-12@hotmail.com wrote: Hi Rie,

OK, I understand your position. I'll forget about them. But there is one parameter which I didn't include to the list in previous letter and it is presented in AzRgf_kw.hpp. I'm talking about random_seed.

It seems to be not implemented. I think that it's very useful parameter in competitive ML. As I understand, at present rgf uses hardcoded random seed, because results are reproducible. Please tell me is it difficult to turn on this parameter. I found line p.vInt(kw_random_seed, &random_seed); which is similar to others used to read user parameters. So, I don't understand why I get warning about unknown parameter.

With best wishes, Nikita Titov

От: Rie Johnson ***@***.com Отправлено: 21 августа 2017 г. 16:13 Кому: Titov Nikita Тема: Re: rgf1.2 hidden parameters

Hi Nikita,

Please ignore those unofficial parameters.

They were not made public for various reasons. Some are not used by RGF but used by other methods implemented for comparison, which are not included in rgf1.2. Some were abandoned, because they were found not to be useful. Although I didn't remove them from the source code, by excluding them from the public interface in the documentation, I saved myself from the burden of testing them for the code release or describing their usage. I'd like to keep it that way.

Best,

Rie

On Fri, Aug 18, 2017 at 5:14 PM, Titov Nikita nekit94-12@hotmail.com wrote: Hello Rie,

Sorry for so many questions but I have another one. I found out that there are many parameters in file AzRgf_kw.hpp which are not documented in a guide file. While their description defined in the same file I can't find default values of them. Please correct me if I'm mistaken in their possible range and help fill in '???'.

param                             type     range           default
------------------------------------------------------------
shrink                              float      >=0                ???
max_depth                     int         -1; >0             -1
max_leaf_tree               int         -1; >0             -1
ApproxPenalty               bool          -                 false  // Applies only to RGF_Sib and RGF_Opt
max_tree                         int          >=0               max(1; max_leaf_forest / 2)
f_ratio                              float     0<x<=1          ???
PassiveRoot                    bool          -                 false
UseInternalNodes         bool          -                 false
WidthFirst                       bool          -                 ???  // Seems to be unused
exit_delta                        float       ???               ???
max_delta                       float       ???               ???
UseIntercept                   bool         -                  false
reg_L1                               float        >=0               0
reg_sL1                             float        >=0               0  // It overrides L1, right? But in what case?
reg_depth_polynomial   ???        ???               ???  // Seems to be unused
reg_depth_offset             ???        ???               ???  // Seems to be unused       
RegularizeRoot                bool         -                  ??? // Seems to be unused
min_penalty_ite              int            >0                20

With best wishes, Nikita Titov

fukatani commented 6 years ago

Thanks! OK, I stopped to use f_ratio.

Depending on the problem, f_ratio may works...

StrikerRUS commented 6 years ago

I think you should talk with Rie about it.

P.S. I'm in progress of adding her to RGF-team.

StrikerRUS commented 6 years ago

Bringing @riejohnson here. Hello Rie! Do you have something to add?

riejohnson commented 6 years ago

Hi. f_ratio didn't make it into the official interface of rgf because I thought it wasn't useful -- not consistently, at least. And so basically, it's untested.

It's possible that it works and it's useful in some cases, but looking at the code, there is a potential problem when compiled with Visual C++ and when the number of features is very large. The thing is, "rand()" is used for picking the features, and so there will be a problem if the number of features is larger than RAND_MAX of the compiler -- only the first (RAND_MAX-1) features would be picked in that case. No problem with gnu g++ as its RAND_MAX is very large, but it's small (RAND_MAX=32765) with Visual C++. I'm not sure how likely it is to have more than 32765 features, though.

fukatani commented 6 years ago

@riejohnson Thank you for joining out team! And thank you for your information.

Since it is a parameter adopted by major decision Forest libraries, I think that it is worth considering.

And we had better to use std::random instead of rand() if we can.

riejohnson commented 6 years ago

Hi. I guess if f_ratio works as it is, it wouldn't hurt to promote it into the official interface with a clear note of the limitation that the number of features must be no greater than RAND_MAX. If f_ratio becomes official, random_seed should become official too.

std::rand() does exactly the same as rand(), at least on Visual Studio and gnu, and so it doesn't seem to me worth changing, does it?

fukatani commented 6 years ago

Yes. Properly, we need std :: mt19937.

riejohnson commented 6 years ago

std::mt19937 in C++11 instead of std::rand()? I see. If you go for it, please don't forget to change srand in AzRgForest.cpp too.

StrikerRUS commented 5 years ago

@fukatani Any news?