Closed andreaswallberg closed 2 months ago
Dear Andreas,
Model-fitting can indeed be tricky with DFE approaches, and some datasets are just not well fitted by the implemented models. I do not know of any miracle solution, but here are some things that I usually try, as well as some recommendations:
Hope this helps!
Julien.
Many thanks for the excellent reply. It seems like a bit of common sense has to go into the evaluation of the results then indeed :-)
I will bump up the number of starts for all runs.
By model averaging, are you recommending averaging across the seemingly well-behaved models with or without any particular weight to each model?
Dear @lgueguen & Co,
I am testing grapes on a set of seven species and have divergence data and SNP unfolded allele frequency spectra across 4,000-6,000 genes per species, with 7 to 20 diploid individuals per focal species. The spectra mostly look like expected with a large amount of derived variants at low frequencies and a little bump at high frequencies. In many cases, I also have a bump of derived variants at near 50% in frequency but this could be rounding errors as I see some depletion in the spectra in adjacent frequency classes.
Some of these datasets behave just fine and the Neutral alpha is very reasonable, as is most of the model-based ones including the "best" models selected by AIC. In other cases however, I observe errors during the run of this kind:
ERROR! ParameterException: ConstraintException: Parameter::setValue(9.21034)[ -9.21034; 9.21034] (posGmean)
Those starting points where this happens appear to terminate early with no finished model likelihood, leading to an ML value that has sometimes been selected from only one or two runs.
Should models that sometimes terminate this way be excluded for that species? Is there any way I can parameterize the runs differently to avoid these issues?
Some species behave quite well, with many models returning values close to but sometimes higher or lower than the Neutral alpha (I have a manually added AIC column here compared to the program output):
However, others behave very odd, with negative alpha:
It seems like those runs that return negative alphas in particular, also produce a lot of messages:
If one goes only by AIC, models that can have completely bonkers alpha statistics would sometimes be preferred.
Can I tweak the analyses to produce more robust results with regards to these types of issues? Will a folded spectrum usually be more reliable? Is the methodology optimized for rather small sample sizes (it seems to be the case that the species with smaller sample sizes never return negative alpha)?
Any suggestions?