Goodness of Fit test suggests model overfit

casparbein commented 2 years ago

Hello Daniel,

I have used the dadi pipeline for many different projects in the past two years, it simplifies the workflow considerably. Thank you for developing and maintaining such a useful tool!

In a last step to finish up one of my analyses, I performed GOF for a rather simple 2 populations assymetric migration model. I have 24 haplotypes in each population, and in order to estimate the best model, I downprojected them to [10,16]. Now, when I run the GOF, my empirical likelihood sits outside of the simulated/fitted likelihoods histogram, but on the right site, meaning that none of the simulations and subsequent fittings reached a likelihood as "good" as the empirical. I read on the user page that this phenomenon ("This prevents odd behavior, such as when the empirical data fit better than the simulated data.") was observed before an update that I indeed did not use for the GOF. However, after downloading and running the updated scripts, I still get an overfit.

Do you have any idea if this is related to my code (see below)? Or if this happens sometimes when one model just fits exceptionally well ?

dd = dadi.Misc.make_data_dict_vcf("2pops.vcf", "popfile_txt")

# create folded SFS
fs = dadi.Spectrum.from_data_dict(dd, ["pop1", "pop2"], projections=[10, 16], polarized=False)

## max projections
## Here, since the model was fit to a downprojected SFS, I used the same numbers as in projections above.
max_proj = [10,16]

#Convert this dictionary into folded AFS object based on
#MAXIMUM projection sizes
#[polarized = False] creates folded spectrum object
max_fs = dadi.Spectrum.from_data_dict(dd, pop_ids=["pop1", "pop2"], projections = max_proj, polarized = False)

## grid size
pts = [50,60,70]

## model
asym_mig_func = Models_2D.asym_mig

##best fitting parameters
emp_params = [12.8641 ,1.6077,1.5714, 2.0899, 5.3618] ## asym_mig

## folded
fs_folded = True

## optimize empirical
fs_for_sims = Optimize_Functions_GOF.Get_Empirical(fs, max_fs, pts, "Empirical", "asym_mig", asym_mig_func, emp_params, [10,16], fs_folded=fs_folded)

## simulations

#number of simulations
sims = 100

#**************
#number of parameters
pnum = 5

#**************
#number of rounds
rounds = 4

#number of reps
reps = [15,10,5,5]

#number of maximum iterations per round
maxiters = [5,5,10,10]

#parameter fold
folds = [3,2,1,1]

## labels
p_labels = "nu1, nu2, m12, m21, T"

Optimize_Functions_GOF.Perform_Sims(sims, fs_for_sims, pts, "asym_mig", asym_mig_func, rounds, pnum, fs_folded=fs_folded,
 reps=reps, maxiters=maxiters, folds=folds, param_labels=p_labels)

Thank you in advance!

Cheers, Bernhard

dportik commented 2 years ago

Hi Bernhard, I am glad you find the workflows useful!

It looks like you might not have used the actual maximum projection sizes. If you have 12 individuals (=24 alleles) in each group, you would actually want to use:

## max projections
## Here, since the model was fit to a downprojected SFS, I used the same numbers as in projections above.
max_proj = [24,24]

Using the same values ([10,16]) to generate your fs and max_fs would actually cause the "odd" behavior that occurred before the bug fix. Please try this using the correct maximum sizes and let me know if the problem persists.

casparbein commented 2 years ago

Thanks for your quick reply! I think I found the problem. I manually updated the scripts by copying from your github page and had forgotten to update the Perform_Sims function, resulting in odd GOF histograms. After updating all scripts fully, it now seems to work!

Cheers, Bernhard

dportik / dadi_pipeline

Goodness of Fit test suggests model overfit #16