Open fukatani opened 6 years ago
Please read my conversation with Rie about it in last year.
Thank you for the explanation! My mistake was to think that RGF uses subsets of features and f_ratio isn't abandoned and isn't equal to one by default.
With best wishes, Nikita Titov
От: Rie Johnson ***@***.com Отправлено: 21 августа 2017 г. 21:02 Кому: Titov Nikita Тема: Re: rgf1.2 hidden parameters
Hi Nikita,
Unlike, say, random forests, RGF is deterministic ("R" is for regularized), that is, RGF does not use random numbers.
The random_seed parameter you found would become active only when "f_ratio" is specified. "f_ratio" was meant to be for randomly choosing a subset of features (e.g., with "f_ratio=0.8", only 80% of feature are used each time or something like that), but it didn't do anything good, so "f_ratio" was abandoned.
Best,
Rie
On Mon, Aug 21, 2017 at 11:07 AM, Titov Nikita nekit94-12@hotmail.com wrote: Hi Rie,
OK, I understand your position. I'll forget about them. But there is one parameter which I didn't include to the list in previous letter and it is presented in AzRgf_kw.hpp. I'm talking about random_seed.
It seems to be not implemented. I think that it's very useful parameter in competitive ML. As I understand, at present rgf uses hardcoded random seed, because results are reproducible. Please tell me is it difficult to turn on this parameter. I found line p.vInt(kw_random_seed, &random_seed); which is similar to others used to read user parameters. So, I don't understand why I get warning about unknown parameter.
With best wishes, Nikita Titov
От: Rie Johnson ***@***.com Отправлено: 21 августа 2017 г. 16:13 Кому: Titov Nikita Тема: Re: rgf1.2 hidden parameters
Hi Nikita,
Please ignore those unofficial parameters.
They were not made public for various reasons. Some are not used by RGF but used by other methods implemented for comparison, which are not included in rgf1.2. Some were abandoned, because they were found not to be useful. Although I didn't remove them from the source code, by excluding them from the public interface in the documentation, I saved myself from the burden of testing them for the code release or describing their usage. I'd like to keep it that way.
Best,
Rie
On Fri, Aug 18, 2017 at 5:14 PM, Titov Nikita nekit94-12@hotmail.com wrote: Hello Rie,
Sorry for so many questions but I have another one. I found out that there are many parameters in file AzRgf_kw.hpp which are not documented in a guide file. While their description defined in the same file I can't find default values of them. Please correct me if I'm mistaken in their possible range and help fill in '???'.
param type range default ------------------------------------------------------------ shrink float >=0 ??? max_depth int -1; >0 -1 max_leaf_tree int -1; >0 -1 ApproxPenalty bool - false // Applies only to RGF_Sib and RGF_Opt max_tree int >=0 max(1; max_leaf_forest / 2) f_ratio float 0<x<=1 ??? PassiveRoot bool - false UseInternalNodes bool - false WidthFirst bool - ??? // Seems to be unused exit_delta float ??? ??? max_delta float ??? ??? UseIntercept bool - false reg_L1 float >=0 0 reg_sL1 float >=0 0 // It overrides L1, right? But in what case? reg_depth_polynomial ??? ??? ??? // Seems to be unused reg_depth_offset ??? ??? ??? // Seems to be unused RegularizeRoot bool - ??? // Seems to be unused min_penalty_ite int >0 20
With best wishes, Nikita Titov
Thanks!
OK, I stopped to use f_ratio
.
Depending on the problem, f_ratio
may works...
I think you should talk with Rie about it.
P.S. I'm in progress of adding her to RGF-team
.
Bringing @riejohnson here. Hello Rie! Do you have something to add?
Hi. f_ratio didn't make it into the official interface of rgf because I thought it wasn't useful -- not consistently, at least. And so basically, it's untested.
It's possible that it works and it's useful in some cases, but looking at the code, there is a potential problem when compiled with Visual C++ and when the number of features is very large. The thing is, "rand()" is used for picking the features, and so there will be a problem if the number of features is larger than RAND_MAX of the compiler -- only the first (RAND_MAX-1) features would be picked in that case. No problem with gnu g++ as its RAND_MAX is very large, but it's small (RAND_MAX=32765) with Visual C++. I'm not sure how likely it is to have more than 32765 features, though.
@riejohnson Thank you for joining out team! And thank you for your information.
Since it is a parameter adopted by major decision Forest libraries, I think that it is worth considering.
And we had better to use std::random
instead of rand()
if we can.
Hi. I guess if f_ratio works as it is, it wouldn't hurt to promote it into the official interface with a clear note of the limitation that the number of features must be no greater than RAND_MAX. If f_ratio becomes official, random_seed should become official too.
std::rand() does exactly the same as rand(), at least on Visual Studio and gnu, and so it doesn't seem to me worth changing, does it?
Yes. Properly, we need std :: mt19937
.
std::mt19937 in C++11 instead of std::rand()? I see. If you go for it, please don't forget to change srand in AzRgForest.cpp too.
@fukatani Any news?
I found not documented parameter
f_ratio
in RGF. This corresponding to LightGBMfeature_fraction
and XGBcolsample_bytree
.I tried these parameter with boston regression example. In small
max_leaf
(300),f_ratio=0.9
improves score to 11.0 from 11.8, but in manymax_leaf
(5000),f_ratio=0.95
degrared score to 10.34 from 10.19810.After all, is there no value to use
f_ratio
< 1.0?