Closed rueuntal closed 8 years ago
I restructured the sensitivity tests (#80 )as follows. It now tests three levels for each factor: Reference group: S = 100, N = 1000, cv = 2 (controls shape of SAD), sigma = 10 (controls aggregation; this is the Poisson case)
To test effect of SAD: cv = 0.5, 1, 1.5 To test effect of N: N = 400, 600, 800 To test effect of aggregation: sigma = 0.02, 0.05, 0.1
@FelixMay and @dmcglinn let me know if these parameters look good to you, or if they should be changed!
mobr_sensitivity.R
ran without error on my computer. But to ramp up to Niter = 200
it would be very time-consuming. @FelixMay would it be possible to run it on iDiv's clusters (once the script is checked and merged)?
@rueuntal yes the parameter settings look fine and yes, I could easily run stuff on our cluster here. I think I just need the R-script and ideally a folder that includes all the files the script depends on of course
Here is the newest result for sensitivity analysis in (#80 ): Ref group is S = 100, N = 1000, cv = 2, sigma = 10 (Poisson) The first three columns of the table shows which parameter has been changed. The remaining columns show the total # of comparisons for each test, and the # where the value is significantly different from the null. The first row is the case where there is (almost) no change. Rows 2 - 4 have different N. Rows 5-7 have different SAD. Rows 8-10 have different levels of aggregation (Thomas instead of Poisson). 50 iterations took about 20 hours on my laptop. In each iteration, the null models were generated by 200 runs.
It looks really promising to me, even for spatial aggregation! Our method is pretty good at separating the effect of spatial aggregation from Poisson, which seems to support Brian's belief that our null model is correct.
@dmcglinn @FelixMay @ngotelli what do you folks think?
Hi Xiao, I am too lazy to calculate how close the error rates are to the standard 5%, but this looks really good and encouraging. Just let me know which files you would like me to run on our HPC cluster if this is still needed.
Hi @FelixMay - running the analysis on cluster would be nice. Currently the results are based on 50 iterations; ideally we'd probably want 500 or maybe 1000.
Let's see what Dan and Nick think. If we all feel that this looks pretty good, I'll add the three scenarios where two factors are changing simultaneously, so that (hopefully!) we can get all test cases done in one run.
Do we need to rewire the code for it to run on cluster? (I vaguely remember the R package snow
)
Wow, nice work, @rueuntal ! Finally, the aggregation results have cleaned up very nicely! Why do you think it was so hard to get them to behave earlier? In fact, the only errors that look slightly high now are for the SADs, but they are also certainly acceptable. I am looking forward to the two-factor results, but it is actually this first pass that is the most important for a basic test. Well done!
Yes, more iterations would be better, but the results really are not going to change much, and I have published these kinds of benchmark test results with as few as 200 reps. The frequencies tend to be pretty stable, particularly when they are close to 1.0 or 0.0. We can safely begin writing and bank on these same patterns with a larger test run on the cluster.
@ngotelli - My guess for aggregation is that previously we were using a simulation with a specific level of aggregation as the reference group, while now we are using Poisson. As Brian suggested (correctly, based on our results), even if two communities are simulated using the same parameter for the Thomas process, they'd still have different levels of aggregation at some scales if they differ at all in N or SAD, which would then be captured by our test. On the other hand, a Poisson is always a Poisson no matter at which scale we look, and two Poisson communities always have the same level of aggregation (which is zero). Does that sound reasonable?
Don't worry too much about the SAD part, that's probably due to the limited # of iterations. I was concerned as well so I reran the SAD tests (which was really fast compared to the other two), and the type I error could range from 0 to about 10%. So hopefully it will stabilize when we bump up the iterations!
I'm also a bit concerned about the test for N, though I couldn't find anything wrong in the code. Does it worry you that it's essentially all or none?
@rueuntal - If remember correctly, Luis Cayuela and I got some similar results for benchmark testing of methods for comparing rarefaction curves, so I think it may be OK. If you have time, you can try exploring data sets that have only slight differences between them.
@ngotelli - thanks, that's reassuring to hear! I tried a number of values where N is closer to reference (850, 900, 950 vs 1000). The detection rate is no longer 100% when N = 900, and it's essentially zero when N = 950 (which looks to me a good thing - we wouldn't expect these factors to have a detectable effect on S when the values are close).
Below is the most up-to-date test results, which include scenarios where two or three factors are changing simultaneously. Our test is still holding up great! If you are all as happy as I am, we are probably ready to move on the next step (writing & packing up the code).
@rueuntal - Yes, these results look great. Error rates are low, power is pretty high, and there are no strong interactions among the terms that are causing unexpected errors. Certainly we are ready to analyze the test cases and write this up. Great work, everyone!
I'm closing this thread so that we can focus our discussion on the remaining steps to clean up and document the code now that our null models and bugs have been worked out. Great job @rueuntal and thanks for the help @ngotelli!
For the new 3-curve (or 4-curve) approach.