cmu-phil / tetrad

Repository for the Tetrad Project, www.phil.cmu.edu/tetrad.
GNU General Public License v2.0
404 stars 111 forks source link

Generating data sets by different seed number? #50

Closed biotech25 closed 8 years ago

biotech25 commented 8 years ago

When I generate 1000 data sets in the Data box, may I be sure that the 1000 data sets will be generated by all different seed numbers? Is there no possibility that I would get the same data sets generated by the same seed number? This question is under assumption that I have many causal variables in my graph and I set large sample size enough to generate all different data sets more than 1000.

I tried to generate 10 data sets when I have just 1 causal variable and 1 target, and I set sample size 2. Then, as we can anticipate, many data sets (7~8 data sets) were all the same. So, I was curious if TETRAD is programmed to assign every different seed number when generating 1000 data sets.

Sorry for asking many questions these days, and thank you, Sanghoon

jdramsey commented 8 years ago

That's actually a pretty good question. I like it. Random number generators generally take the current time in milliseconds as the random seed. But if you create many random number generators very quickly, say within a millisecond, they will all use the same random seed. The workaround for this in Tetrad is RandomUtil, which you can look at. It is created only once, with one random seed, and then that same random number generator is always used thereafter. There is only a finite number of long numbers, but it takes a very long time to go through them all, and good random number generators are designed to have very long cycles. But it is a finite problem, and if you kept generatating random numbers forever you would find duplicates. But with 1000 data sets, it's extraordinarily unlikely.

If you like, you can choose a diferent random number generator. We're using the ones from Apache nowadays, and there are many to choose from. Check out RandomUtil.

biotech25 commented 8 years ago

Thank you for your explanation. I could understand thanks to your easy explanation. I plan to use TETRAD java later. Then, I am going to study more and try to test it. For now, I am using Tetrad workspace and .jar.

I thought.. maybe there should be some constraint or alert for users to not generate data sets or to lower the number of data sets that the user can generate when the number of causal variables and sample size are not enough. (I am sorry that this idea is making you feel headache.)

I appreciate your time. Sanghoon

jdramsey commented 8 years ago

No problem!