catWeight documentation/explanation

ludoro commented 4 years ago

Hey @MrUrq,

Super cool library. I am working on MLJTuning where we want to have a LatinHypercube hyper-parameter optimization method, so I am using your library there. One small issue I have is the use of "catWeight". There are many cases where we have categorical values, but it's not very clear how that parameter works. At the moment I just always set it to 0. I have not found any reference of it in the two papers you list as reference, would you care to share some light on it?

Thanks a lot!

MrUrq commented 4 years ago

Hi @ludoro,

Glad you are finding it useful! You can read more about it in our paper https://doi.org/10.1016/j.asoc.2019.106050 where you can see the effect of it in Figure 3. But in general you can think of it as a distance between the categorical dimensions when the sampling plan is optimised.

For the example in the documentation (https://mrurq.github.io/LatinHypercubeSampling.jl/stable/man/categorical/) you can think of catWeight=1000 as a large separation between the categorical dimensions which is similar to making separate LHC plans for each category. catWeight=0 can be interpreted as no separation between the categorical dimensions where the categorical dimensions for each point is selected randomly. The risk of having it set to 0 is that all points in one dimension could become clustered to one side of the design space without any penalty. In general I would suggest to use some separation like catWeight=1 to prevent this from happening.

A small note, in the paper the weight values refer to a LHC which is scaled from 0 to 1. In this package the LHC is unscaled integers starting from 1 to N where N is the number of samples. So a catWeight=1 is the same as the step distance in each dimension.

ludoro commented 4 years ago

I see, thanks a lot!

MrUrq / LatinHypercubeSampling.jl

catWeight documentation/explanation #14