bd2kccd / causal-cmd

16 stars 8 forks source link

Specifying random seed when sampling in bootstrap (wish) #80

Closed yasu-sh closed 1 year ago

yasu-sh commented 1 year ago

Hi developers,

Thank you for your running and maintainance this project.

I have tried to utilize the specific random seed for bootstrapping since I am requested to get the same result for the reproduction-reliabilty-sensitive case from a colleague. I just put into the RandomUtil class in the code, but it does not affects the result.(every time different results).

Could you add the new function to enable the specific random seed for the sampling in bootstrap? In tetrad-lib and tetrad-gui, test code use many setSeed since it is always mandatory for test. This is just my wishes. It helps for checking reliablility at user-side.

I hope that you tell me the place for putting "setSeed" methods into the somewhere in program. My java skill is limited. but I have made up the dev environment and tried some modification for understanding. I thought parallel search makes threads and each seed. As checking the code, I now understand bootstrap-sampling and searches are separated. runParallel affects only multi-thread serarch.

image

image

DataUtils.java image

yasu-sh commented 1 year ago

@kvb2univpitt if I should make an issue to tetrad's original repository, let me know.

kvb2univpitt commented 1 year ago

@yasu-sh I can take a look at it first. It may be a tetrad's issue.

kvb2univpitt commented 1 year ago

@yasu-sh We decide to create another method with the seed parameter. See https://github.com/cmu-phil/tetrad/issues/1413.

yasu-sh commented 1 year ago

@kvb2univpitt I really appreciate your consideration and implementation!

yasu-sh commented 1 year ago

@kvb2univpitt Thank you for your implementation. I have tried ver. 1.5.0 jar file downloaded from CI site as README's description. The result become much better.

Before random seed setting enabled, the result changed about 50 % probability when I use asia dataset. (bnlearn: 8 variables / 5000samples). Now I get 100% probabilities of the final result is the same. Excellent. As for hailfinder (56 variables 20000 samples), the result becomes sometimes different, even at NO bootstrapping search. image

Let me confirm the small matter.

I have read the code and I understand @kvb2univpitt made different trials to set.seed, it could not be done easily. Not just put the set.seed function in the bootstrapping sampling.

Used code java -Xmx3G -Xss4096k -jar causal-cmd-1.5.0-jar-with-dependencies.jar --data-type discrete --delimiter comma --score bdeu-score --json-graph --maxDegree 6 --algorithm fges --dataset dt.tetrad.csv --prefix c-tetrad --seed 42 --priorEquivalentSampleSize 10 --numberResampling 9 --addOriginalDataset --percentResampleSize 90 --resamplingEnsemble 2 --parallelized --symmetricFirstStep

prior.txt c-tetrad_out.txt dt.tetrad.csv replaced variable and catetory names coverted from original bnlearn code

condition Win10 22H2

java -version openjdk version "11.0.18" 2023-01-17 OpenJDK Runtime Environment Temurin-11.0.18+10 (build 11.0.18+10) OpenJDK 64-Bit Server VM Temurin-11.0.18+10 (build 11.0.18+10, mixed mode)

kvb2univpitt commented 1 year ago

@yasu-sh I am glad adding the seed works out for you. Yes, the seed is currently only applies to bootstrapping in causal-cmd, since running search algorithm is the only thing that it does. @jdramsey (Joe) would like to extends the seed parameter to data simulation and possibly other features in the future. That's why the parameter name is seed instead of bootstrap-seed.

yasu-sh commented 1 year ago

@kvb2univpitt OK! I understand the backbround. I'll see and use v1.5.0 release with trust.