BioPP / grapes

(Greetings_here_is_yet_another) Rate of Adaptive Protein Evolution Software
GNU General Public License v3.0
9 stars 1 forks source link

Some datasets and models give ParameterException errors or negative alpha values #8

Closed andreaswallberg closed 2 months ago

andreaswallberg commented 2 years ago

Dear @lgueguen & Co,

I am testing grapes on a set of seven species and have divergence data and SNP unfolded allele frequency spectra across 4,000-6,000 genes per species, with 7 to 20 diploid individuals per focal species. The spectra mostly look like expected with a large amount of derived variants at low frequencies and a little bump at high frequencies. In many cases, I also have a bump of derived variants at near 50% in frequency but this could be rounding errors as I see some depletion in the spectra in adjacent frequency classes.

Some of these datasets behave just fine and the Neutral alpha is very reasonable, as is most of the model-based ones including the "best" models selected by AIC. In other cases however, I observe errors during the run of this kind:

ERROR! ParameterException: ConstraintException: Parameter::setValue(9.21034)[ -9.21034; 9.21034] (posGmean)

Those starting points where this happens appear to terminate early with no finished model likelihood, leading to an ML value that has sometimes been selected from only one or two runs.

Should models that sometimes terminate this way be excluded for that species? Is there any way I can parameterize the runs differently to avoid these issues?

Some species behave quite well, with many models returning values close to but sometimes higher or lower than the Neutral alpha (I have a manually added AIC column here compared to the program output):

model | #param | AIC | lnL | alpha
-- | -- | -- | -- | --
Neutral | 17 | 1281.1732 | -623.5866 | 0.3277
GammaZero | 18 | 641.6858 | -302.8429 | 0.442
GammaExpo | 20 | 450.8098 | -205.4049 | 0.3157
GammaGamma | 21 | 465.2652 | -211.6326 | 0.292
DisplGamma | 19 | 643.6858 | -302.8429 | 0.442
ScaledBeta | 19 | 519.0958 | -240.5479 | 0.2447
FGMBesselK | 20 | 1890.8004 | -925.4002 | 0.5944
Saturated | 28 | 304.8302 | -124.4151 | NA

However, others behave very odd, with negative alpha:

model | #param |   | AICc | lnL | alpha
-- | -- | -- | -- | -- | --
Neutral | 43 |   | 2015.22333846154 | -1035.3809 | 0.2052
GammaZero | 44 |   | 3106.36773333333 | -1580.5172 | 0.3729
GammaExpo | 46 |   | 1425.73195172414 | -739.4177 | -0.0826
GammaGamma | 47 |   | 1270.7828 | -661.5914 | -0.3284
DisplGamma | 45 |   | 2636.65505714286 | -1345.2561 | 0.4775
ScaledBeta | 45 |   | 1172.85685714286 | -613.357 | -0.5038
FGMBesselK | 46 |   | 6076.62995172414 | -3064.8667 | 0.7054
Saturated | 80 |   | 645.112714285714 | -343.4135 | NA

It seems like those runs that return negative alphas in particular, also produce a lot of messages:

!!! Second order derivative is negative for parameter posGmean(0.972955). Moving in the other direction.
!!! Second order derivative is negative for parameter syno_orientation_error(-1.09861). Moving in the other direction.
!!! Function at new point is greater than at current point: 100348>8653.47. Applying Felsenstein-Churchill correction: 0
!!! Function at new point is greater than at current point: 17748.6>8653.47. Applying Felsenstein-Churchill correction: 1
!!! Second order derivative is negative for parameter posGshape(-7.3345). Moving in the other direction.
!!! Function at new point is greater than at current point: 7054.69>5574.52. Applying Felsenstein-Churchill correction: 0
!!! Second order derivative is negative for parameter posGmean(0.914797). Moving in the other direction.
!!! Function at new point is greater than at current point: 5789.82>4931.11. Applying Felsenstein-Churchill correction: 0
!!! Second order derivative is negative for parameter posGshape(-7.33364). Moving in the other direction.
!!! Second order derivative is negative for parameter syno_orientation_error(1.32771). Moving in the other direction.
!!! Second order derivative is negative for parameter posGmean(0.938922). Moving in the other direction.
!!! Function at new point is greater than at current point: 4675.6>4158.85. Applying Felsenstein-Churchill correction: 0
!!! Second order derivative is negative for parameter posGshape(-7.0803). Moving in the other direction.
!!! Second order derivative is negative for parameter posGmean(0.785971). Moving in the other direction.
!!! Function at new point is greater than at current point: 3777.78>3634.32. Applying Felsenstein-Churchill correction: 0
!!! Second order derivative is negative for parameter posGshape(-6.80442). Moving in the other direction.
!!! Function at new point is greater than at current point: 15799.1>3367.39. Applying Felsenstein-Churchill correction: 0
!!! Function at new point is greater than at current point: 13615.8>3367.39. Applying Felsenstein-Churchill correction: 1
!!! Function at new point is greater than at current point: 12232>3367.39. Applying Felsenstein-Churchill correction: 2
!!! Felsenstein-Churchill correction applied too many times.
Use conjugate gradients optimization.

If one goes only by AIC, models that can have completely bonkers alpha statistics would sometimes be preferred.

Can I tweak the analyses to produce more robust results with regards to these types of issues? Will a folded spectrum usually be more reliable? Is the methodology optimized for rather small sample sizes (it seems to be the case that the species with smaller sample sizes never return negative alpha)?

Any suggestions?

jydu commented 2 years ago

Dear Andreas,

Model-fitting can indeed be tricky with DFE approaches, and some datasets are just not well fitted by the implemented models. I do not know of any miracle solution, but here are some things that I usually try, as well as some recommendations:

Hope this helps!

Julien.

andreaswallberg commented 2 years ago

Many thanks for the excellent reply. It seems like a bit of common sense has to go into the evaluation of the results then indeed :-)

I will bump up the number of starts for all runs.

By model averaging, are you recommending averaging across the seemingly well-behaved models with or without any particular weight to each model?