dsy109 / mixtools

Tools for Analyzing Finite Mixture Models
18 stars 4 forks source link

Mixture model converges to wrong result (randomly) #7

Open hberger opened 1 year ago

hberger commented 1 year ago

Hi, I have been trying to model an approximately gaussian mixture (image data, background + foreground intensities) using normalmixEM with 2 or 3 components.

While this seems to work generally, quite often I get results that just do not seem to make any sense in terms of the initial data distribution. In particular, the input distribution is clearly bimodal with a 30/70 split in proportions, but the resulting mixture does not fit the modes nor does the sum of the mixtures seem to fit the overall distribution well.

This problem happens for certain data sets for k=2 about 1 in 5 times and about 1 in 100-200 times for k=3. I can reproduce this behavior using specific random seeds for both k=2 and k=3. This does not seem to depend so much on the initial data as removing data points or subsampling from 100k to 1k data points does not change the behavior. Removing the long tail to the right also does not change anything. I tried to play around with initial lambda values but that does not seem to change much.

Tested in mixtools 1.2.0 on R 4.1.2 (openblas) and mixtools 2.0 on R 4.2.2 (standard R BLAS), both on Linux.

To reproduce (example data have been attached below):

library(mixtools)

#d = readRDS("100k_vector_for_normalmixEM.rds")
# 1k subsample from 100k data
d = readRDS("1k_vector_for_normalmixEM.rds")

hist(d, 100)

# k=2; problem occurs about 1 every 5 runs without setting random seed
set.seed(1238)
mixmdl = normalmixEM(d, k=2)
summary(mixmdl)
plot(mixmdl, which=2)

# k=3; much less frequent without setting random seed
set.seed(10319)
mixmdl = normalmixEM(d, k=3)
summary(mixmdl)
plot(mixmdl, which=2)

Results from the 100k data set (1k looks similar) for k=2:

summary of normalmixEM object:
          comp 1     comp 2
lambda 0.8223187 0.17768126
mu     0.0119736 0.01614824
sigma  0.0050708 0.00156856
loglik at estimate:  391868.6 

image

Output from a 'good' run (100k, k=2):

summary of normalmixEM object:
           comp 1     comp 2
lambda 0.29123448 0.70876552
mu     0.00687830 0.01511378
sigma  0.00119344 0.00370195
loglik at estimate:  402918.3 

example_data.zip

hberger commented 1 year ago

Just an additional comment: I can fix the problem when using custom initialization values derived from k-means on the data:

km <- kmeans(d, 2)
mu_km = km$centers[, 1]
sd_km = tapply(d, km$cluster, sd)
mixmdl = normalmixEM(d, k=2, fast=T, mu = mu_km, sigma = sd_km)

This leads to stable prediction of the correct population parameters.