NickleDave / songdkl

automated quantitation of vocal learning in the songbird
BSD 3-Clause "New" or "Revised" License
6 stars 1 forks source link

Understand differences returned by GMM across sklearn versions #46

Open NickleDave opened 2 years ago

NickleDave commented 2 years ago

as discussed in #29

NickleDave commented 2 years ago

Some input from @dgmets:

With regard to the differences in how the GMMs are fit between the older version of sklearn and the newer version. The easiest example is provided by the tutorial on GMM fitting for each of the methods.

The original method is here: https://scikit-learn.sourceforge.net/dev/auto_examples/mixture/plot_gmm_selection.html And the same tutorial using the new method is here: https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html#sphx-glr-auto-examples-mixture-plot-gmm-selection-py

There are a few things to note. First, the variance for the components (shown as the filled ovals) are different for the two methods, in some cases, they are larger and in others they are smaller. Second, depending on the covariance estimate approach (e.g. spherical, tied, full) the number of estimated components (by the minimum BIC) can be different between the two methods. This suggests that the differences you have been seeing in the estimated number of syllables and the differences in self-self similarity may depend on differences in the way the method is estimating the variance components of the GMMs. I haven't run though any tests using the new version of the fitting algorithm to look at how sampling (e.g. number of syllables in the data set) impacts the estimates.

NickleDave commented 1 year ago

I realized that I failed to capture how I generated the conda env at the command line.

Probably a better way to do this would be to hand write an environment.yml or even better (I think?) use something like conda-lock.

But for now:

mamba create -n songdkl-pcb-new python=2.7 "mahotas<=1.4.9" matplotlib "scikit-learn<=0.18" scipy -c conda-forge -c defaults