ctlab / GADMA

Genetic Algorithm for Demographic Model Analysis
Other
47 stars 14 forks source link

Error when trying to plot from best_logLL_model_moments_code.py #5

Closed Mills33 closed 4 years ago

Mills33 commented 4 years ago

Using python3 and the command python3 best_logLL_model_moments_code.py I get the following error:

'File " best_logLL_model_moments_code.py", line 7 def generated_model((nu21, nu31, nu41, nu51, t1, t2, t3, t4), ns): ^ SyntaxError: invalid syntax

File contents:

#current best params = [4286.26466398839, 428626.4663984032, 2925.898563085724, 30281.262262553322, 60.27017335837271, 11270.983729593085, 4578.233459954076, 402.4881470236583, 62.209760218362085]
import matplotlib
matplotlib.use("Agg")
import moments
import numpy as np

def generated_model((nu21, nu31, nu41, nu51, t1, t2, t3, t4), ns):
    theta1 = 1
    sts = moments.LinearSystem_1D.steady_state_1D(sum(ns), theta=theta1)
    fs = moments.Spectrum(sts)

    T = t1
    after = [nu21]
    growth_funcs = [lambda t: after[0]]
    list_growth_funcs = lambda t: [ f(t) for f in growth_funcs]
    fs.integrate(Npop=list_growth_funcs, tf=T, dt_fac=0.1, theta=theta1)

    before = after
    T = t2
    after = [nu31]
    growth_funcs = [lambda t: before[0] * (after[0] / before[0]) ** (t / T)]
    list_growth_funcs = lambda t: [ f(t) for f in growth_funcs]
    fs.integrate(Npop=list_growth_funcs, tf=T, dt_fac=0.1, theta=theta1)

    before = after
    T = t3
    after = [nu41]
    growth_funcs = [lambda t: before[0] + (after[0] - before[0]) * (t / T)]
    list_growth_funcs = lambda t: [ f(t) for f in growth_funcs]
    fs.integrate(Npop=list_growth_funcs, tf=T, dt_fac=0.1, theta=theta1)

    before = after
    T = t4
    after = [nu51]
    growth_funcs = [lambda t: before[0] * (after[0] / before[0]) ** (t / T)]
    list_growth_funcs = lambda t: [ f(t) for f in growth_funcs]
    fs.integrate(Npop=list_growth_funcs, tf=T, dt_fac=0.1, theta=theta1)

    return fs
data = moments.Spectrum.from_file('/Users/ryanc/Documents/Pink_pigeon/Population_genetics/Demography/GADMA/data/output_MK25_25prune/dadi/pop1-75.sfs')

popt = [99.99999999989834, 0.6826220013122478, 7.064720598558761, 0.014061234684068955, 2.6295585114675073, 1.068117304658926, 0.09390184194765547, 0.014513746839066071]
ns = [75]
model = generated_model(popt, ns)
N_A = 4286.264664
ll_model = moments.Inference.ll(N_A * model, data)
print('Model log likelihood (LL(model, data)): {0}'.format(ll_model))
print('Drawing model to model_from_GADMA.png')
model = moments.ModelPlot.generate_model(generated_model, popt, ns)
moments.ModelPlot.plot_model(model, 
    save_file='model_from_GADMA.png',
    fig_title='Demographic model from GADMA',
    pop_labels=['pop1'],
    nref=None,
    draw_scale=False,
    gen_time=5.6,
    gen_time_units='Years',
    reverse_timeline=True)
noscode commented 4 years ago

Solved in commit 57316aa.

samarth8392 commented 4 years ago

Hi, I was running into the same issue as the OP, but then I saw this issue and updated the software. But now, it's giving me a different error.

Traceback (most recent call last): File "gadma.R3.m4.moments.py", line 37, in model = generated_model(popt, ns) File "gadma.R3.m4.moments.py", line 17, in generated_model fs.integrate(Npop=list_growth_funcs, tf=T, dt_fac=0.1, theta=theta1) File "/home/mathur20/.conda/envs/cent7/5.3.1-py37/gadma/lib/python3.7/site-packages/moments-1.0.2-py3.7.egg/moments/Spectrum_mod.py", line 1588, in integrate File "/home/mathur20/.conda/envs/cent7/5.3.1-py37/gadma/lib/python3.7/site-packages/moments-1.0.2-py3.7.egg/moments/Integration_nomig.py", line 627, in integrate_neutral File "gadma.R3.m4.moments.py", line 16, in list_growth_funcs = lambda t: [ f(t) for f in growth_funcs] File "gadma.R3.m4.moments.py", line 16, in list_growth_funcs = lambda t: [ f(t) for f in growth_funcs] File "gadma.R3.m4.moments.py", line 15, in growth_funcs = [lambda t: before[0] + (after[0] - before[0]) * (t / T)] NameError: free variable 'before' referenced before assignment in enclosing scope

I think the error is because the code file generated has a growth function: growth_funcs = [lambda t: before[0] + (after[0] - before[0]) * (t / T)]

but the before = after code comes after.

The moments code is:

`#current best params = [8.354928346378932, 0.88575956698093, 0.23007835953494235, 0.4805958920229654, 0.3502832658458346, 0.06304853182036993, 19.50065332970754, 7.965459940348304]
import matplotlib
matplotlib.use("Agg")
import moments
import numpy as np

def generated_model(params, ns):
    nu21, s0, nu31, nu32, t1, t2, m2_12, m2_21 = params
    theta1 = 0.2838653
    sts = moments.LinearSystem_1D.steady_state_1D(sum(ns), theta=theta1)
    fs = moments.Spectrum(sts)

    T = t1
    after = [nu21]
    growth_funcs = [lambda t: before[0] + (after[0] - before[0]) * (t / T)]
    list_growth_funcs = lambda t: [ f(t) for f in growth_funcs]
    fs.integrate(Npop=list_growth_funcs, tf=T, dt_fac=0.1, theta=theta1)

    before = after
    fs = moments.Manips.split_1D_to_2D(fs, ns[0], sum(ns[1:]))

    before.append((1 - s0) * before[-1])
    before[-2] *= s0
    T = t2
    after = [nu31, nu32]
    growth_funcs = [lambda t: before[0] * (after[0] / before[0]) ** (t / T), lambda t: after[1]]
    list_growth_funcs = lambda t: [ f(t) for f in growth_funcs]
    m = np.array([[0, m2_12],[m2_21, 0]])
    fs.integrate(Npop=list_growth_funcs, tf=T, m=m, dt_fac=0.1, theta=theta1)

    return fs
data = moments.Spectrum.from_file('/scratch/snyder/m/mathur20/MQU/ch3/demography/gadma/only_syn/2pop/sfs/dadi/TX-AZ.sfs')
data = data.project([25, 25])

popt = [8.354928346378932, 0.88575956698093, 0.23007835953494235, 0.4805958920229654, 0.3502832658458346, 0.06304853182036993, 19.50065332970754, 7.965459940348304]
ns = [25, 25]
model = generated_model(popt, ns)
N_A = 115793.640440
ll_model = moments.Inference.ll(N_A * model, data)
print('Model log likelihood (LL(model, data)): {0}'.format(ll_model))
theta0 = 0.2838653
print('Drawing model to model_from_GADMA.png')
model = moments.ModelPlot.generate_model(generated_model, popt, ns)
moments.ModelPlot.plot_model(model, 
    save_file='model_from_GADMA.png',
    fig_title='Demographic model from GADMA',
    pop_labels=['TX', 'AZ'],
    nref=int(theta / theta0),
    draw_scale=True,
    gen_time=0.001,
    gen_time_units='Thousand years',
    reverse_timeline=True)
try:
    import gadma
    print ("Calculation of CLAIC (time consuming, enter Ctrl+C to stop)")
    all_boot = gadma.Inference.load_bootstrap_data_from_dir("/scratch/snyder/m/mathur20/MQU/ch3/demography/gadma/final/bootsfs")
    for eps in [1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8]:
        try:
            claic_score = gadma.Inference.get_claic_score(func_ex, all_boot, popt, data, pts=None, eps=eps)
        except Exception as e:
            print("Error for eps = {0:.1e} : ".format(eps) + str(e))
        claic_score = None
    print("Model Composite Likelihood AIC score (CLAIC(p0, eps={0:.1e})): {1}".format(eps, claic_score))
except ImportError:
    print("Install GADMA to calculate CLAIC score")`

Please let me know if this is a bug or something that I can just edit on my end in the code to make it work.

Thank you for your time and writing such a brilliant software :)

noscode commented 4 years ago

Dear @samarth8392, thank you for your interest to GADMA!

It is bug in code generation of GADMA. You are right about the error. I will check this out and try to fix generation, but now I suggest to fix it by yourself adding before = [1.0] right before the first block:

        ...
        fs = moments.Spectrum(sts)

        before = [1.0]
        T = t1
    after = [nu21]
        ...

I hope that will help!

Best regards, Ekaterina

samarth8392 commented 4 years ago

Hi Ekaterina, Thank you for the quick response. I added the before statement and tried running it again, but now the error is:

Calculation of CLAIC (time consuming, enter Ctrl+C to stop)
If you use the Godambe methods in your published research, please cite Coffman et al. (2016) in addition to the main dadi paper Gutenkunst et al. (2009).
AJ Coffman, P Hsieh, S Gravel, RN Gutenkunst "Computationally efficient composite likelihood statistics for demographic inference" Molecular Biology and Evolution 33:591-593 (2016)
Traceback (most recent call last):
  File "final_model6_moments_code.py", line 58, in <module>
    claic_score = gadma.Inference.get_claic_score(generated_model, all_boot, popt, data, pts=None, eps=1e-15)
  File "/home/mathur20/.conda/envs/cent7/5.3.1-py37/gadma/lib/python3.7/site-packages/gadma-1.0.1-py3.7.egg/gadma/Inference.py", line 117, in get_claic_score
    return 2 * get_claic_component(func_ex, all_boot, p0, data, pts=pts, eps=eps) - 2 * ll_model
  File "/home/mathur20/.conda/envs/cent7/5.3.1-py37/gadma/lib/python3.7/site-packages/gadma-1.0.1-py3.7.egg/gadma/Inference.py", line 86, in get_claic_component
    H_inv = np.linalg.inv(H)
  File "<__array_function__ internals>", line 6, in inv
  File "/home/mathur20/.local/lib/python3.7/site-packages/numpy/linalg/linalg.py", line 551, in inv
    ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
  File "/home/mathur20/.local/lib/python3.7/site-packages/numpy/linalg/linalg.py", line 97, in _raise_linalgerror_singular
    raise LinAlgError("Singular matrix")
numpy.linalg.LinAlgError: Singular matrix

I am not sure if I understand "singular matrix error"

My moments script is:

#current best params = [16.01065952404785, 0.7305758264654852, 0.2462289184792431, 0.46875443298942726, 0.2984985370324451, 0.12532423399443587, 19.521916397305933, 6.177478938766556]
import matplotlib
matplotlib.use("Agg")
import moments
import numpy as np

def generated_model(params, ns):
    nu21, s0, nu31, nu32, t1, t2, m2_12, m2_21 = params
    theta1 = 0.2838653
    sts = moments.LinearSystem_1D.steady_state_1D(sum(ns), theta=theta1)
    fs = moments.Spectrum(sts)

    before = [1.0]
    T = t1
    after = [nu21]
    growth_funcs = [lambda t: after[0]]
    list_growth_funcs = lambda t: [ f(t) for f in growth_funcs]
    fs.integrate(Npop=list_growth_funcs, tf=T, dt_fac=0.1, theta=theta1)

    before = after
    fs = moments.Manips.split_1D_to_2D(fs, ns[0], sum(ns[1:]))

    before.append((1 - s0) * before[-1])
    before[-2] *= s0
    T = t2
    after = [nu31, nu32]
    growth_funcs = [lambda t: after[0], lambda t: before[1] * (after[1] / before[1]) ** (t / T)]
    list_growth_funcs = lambda t: [ f(t) for f in growth_funcs]
    m = np.array([[0, m2_12],[m2_21, 0]])
    fs.integrate(Npop=list_growth_funcs, tf=T, m=m, dt_fac=0.1, theta=theta1)

    return fs
data = moments.Spectrum.from_file('/scratch/snyder/m/mathur20/MQU/ch3/demography/gadma/only_syn/2pop/sfs/dadi/TX-AZ.sfs')
data = data.project([25, 25])

popt = [16.01065952404785, 0.7305758264654852, 0.2462289184792431, 0.46875443298942726, 0.2984985370324451, 0.12532423399443587, 19.521916397305933, 6.177478938766556]
ns = [25, 25]
model = generated_model(popt, ns)
N_A = 114402.034028
ll_model = moments.Inference.ll(N_A * model, data)
print('Model log likelihood (LL(model, data)): {0}'.format(ll_model))
theta0 = 0.2838653
#print('Drawing model to model_from_GADMA.png')
#model = moments.ModelPlot.generate_model(generated_model, popt, ns)
#moments.ModelPlot.plot_model(model, 
#   save_file='model_from_GADMA.png',
#   fig_title='Demographic model from GADMA',
#   pop_labels=['TX', 'AZ'],
#   nref=int(theta / theta0),
#   draw_scale=True,
#   gen_time=0.001,
#   gen_time_units='Thousand years',
#   reverse_timeline=True)
try:
    import gadma
    print ("Calculation of CLAIC (time consuming, enter Ctrl+C to stop)")
    all_boot = gadma.Inference.load_bootstrap_data_from_dir("/scratch/snyder/m/mathur20/MQU/ch3/demography/gadma/final/bootsfs")
    claic_score = gadma.Inference.get_claic_score(generated_model, all_boot, popt, data, pts=None, eps=1e-15)
    print("Model Composite Likelihood AIC score (CLAIC(p0, eps={0:.1e})): {1}".format(eps, claic_score))
    #for eps in [1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8]:
    #   try:
    #       claic_score = gadma.Inference.get_claic_score(func_ex, all_boot, popt, data, pts=None, eps=eps)
    #   except Exception as e:
    #       print("Error for eps = {0:.1e} : ".format(eps) + str(e))
    #   claic_score = None
    #   print("Model Composite Likelihood AIC score (CLAIC(p0, eps={0:.1e})): {1}".format(eps, claic_score))
except ImportError:
    print("Install GADMA to calculate CLAIC score")
noscode commented 4 years ago

Dear @samarth8392,

It is numerical error of CLAIC evaluation with moments. I hope that changing eps argument will solve this issue. You could just change eps=1e-15 in function gadma.Inference.get_claic_score to some other value. Or you can loop over several values of eps and try them all.

For the second variant you should uncomment commented code in the file you have attached:

        #for eps in [1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8]:
    #   try:
    #       claic_score = gadma.Inference.get_claic_score(func_ex, all_boot, popt, data, pts=None, eps=eps)
    #   except Exception as e:
    #       print("Error for eps = {0:.1e} : ".format(eps) + str(e))
    #   claic_score = None
    #   print("Model Composite Likelihood AIC score (CLAIC(p0, eps={0:.1e})): {1}".format(eps, claic_score))

and comment out or remove two lines above this block.

The smaller eps is the better value of CLAIC is but this numerical issues sometimes spoil everything. Have you asked GADMA to evaluate CLAIC during the run? If so it usually prints eps that was used for the calculation, was it 1e-15? Please keep me updated.

Ekaterina

samarth8392 commented 4 years ago

Hi Ekaterina, I chose a single value because I wasn't sure how long it would take, but apparently I chose the worst value. I ran the code in a for loop and I now got an output

Model log likelihood (LL(model, data)): -3058.901747281811
Calculation of CLAIC (time consuming, enter Ctrl+C to stop)
If you use the Godambe methods in your published research, please cite Coffman et al. (2016) in addition to the main dadi paper Gutenkunst et al. (2009).
AJ Coffman, P Hsieh, S Gravel, RN Gutenkunst "Computationally efficient composite likelihood statistics for demographic inference" Molecular Biology and Evolution 33:591-593 (2016)
Model Composite Likelihood AIC score (CLAIC(p0, eps=1.0e-02)): 607922.7364217453
Model Composite Likelihood AIC score (CLAIC(p0, eps=1.0e-03)): 607688.7841463592
Model Composite Likelihood AIC score (CLAIC(p0, eps=1.0e-04)): 607684.8027957648
Model Composite Likelihood AIC score (CLAIC(p0, eps=1.0e-05)): 607612.6622176758
Model Composite Likelihood AIC score (CLAIC(p0, eps=1.0e-06)): 586690.9280410497
Model Composite Likelihood AIC score (CLAIC(p0, eps=1.0e-07)): 353854.05379018356
Model Composite Likelihood AIC score (CLAIC(p0, eps=1.0e-08)): -15222.039192122913

I am amazed how fast it was. I am not sure about negative values, but I am glad I have an output. My next goal is to get confidence intervals using gadma-run_ls_on_boot_data. Any thoughts about running it?

Again, I am so thankful to this software and your time.

Cheers, Samarth

noscode commented 4 years ago

Dear Samarth,

Cool! I think the reliable value is the following: Model Composite Likelihood AIC score (CLAIC(p0, eps=1.0e-05)): 607612.6622176758

To run gadma-run_ls_on_boot_data comment out all except the generated_model or copy it in the separate file. Also the parameters file for those scripts are a bit different to those from GADMA - they should be valid python code with the following variables: p0, lower_bound, upper_bound, par_labels, pts (only for dadi, not your case).

So it should be something like that:

...
par_labels = ['nu21', 's0', 'nu31', 'nu32', 't1', 't2', 'm2_12', 'm2_21']
...

p0 is the initial point - the result vector of GADMA (popt from your file). You could also check the usage of scripts by calling it with --help flag. I am not sure but may be lower and upper bound could be automatically generated from par_labels.

Best regards, Ekaterina

samarth8392 commented 4 years ago

Hi Ekaterina, Thanks for the advice. I ran thegadma-run_ls_on_boot_data There were a few bugs in the source code that I edited:

# ERRORS:
#File "/home/mathur20/.conda/envs/cent7/5.3.1-py37/gadma/lib/python3.7/site-packages/gadma-1.0.1-py3.7.egg/gadma/run_ls_on_boot_data.py", line 87
#   print str_after_bracket

#File "/home/mathur20/.conda/envs/cent7/5.3.1-py37/gadma/lib/python3.7/site-packages/gadma-1.0.1-py3.7.egg/gadma/run_ls_on_boot_data.py", line 158
#    except Exception, e:

#File "/home/mathur20/.conda/envs/cent7/5.3.1-py37/gadma/lib/python3.7/site-packages/gadma-1.0.1-py3.7.egg/gadma/run_ls_on_boot_data.py", line 237
#    except Exception, e:  

But after I run, I found all my values for bootstrap to be exactly the same. i.e. the CI were also point estimates without any range.

Do you know why would that be the case? I am wondering if I need to change anything? I have 28, and 31 samples for the two pop and I projected them in 25x25 sizes. Also, I am wondering if it is because I am using the same SNP data set. You mentioned in the manual that for CLAIC and CI, we need unlinked markers. If by using the same data for initial modelling and CI, I am making a mistake, let me know. The only workaround I think I can do is by thinning my SNPs to be far enough that they become practically unlinked.

Let me know your thoughts.

Cheers, Samarth

noscode commented 4 years ago

Dear Samarth,

I have updated the version of run_ls_on_boot_data script. I am not sure about your bootstrap data. Have you perform bootstrap on your SNP's? So every bootstrapped data is different to the original spectrum for inference, correct? Also have you downsized each of the bootstrapped data before running the script? Because script does not know about what size data has and uses just maximum size. At last maybe change of optimization for gadma-run_ls_on_boot_data will solve the situation that all runs have the same result values. It is strange behavior indeed, I hope we will find out what is wrong.

Best regards, Ekaterina

samarth8392 commented 4 years ago

Hi Ekaterina, Sorry for not responding earlier. I realized my bootstrapping was incorrect. You guessed it correctly, I was giving it the same SFS (and expecting different output). I downsampled my SNPs (using vcf thinning in vcftools) and reran it, and it gave me a CI range. Again, thank you for your immediate responses. Cannot thank you enough.

Cheers, Samarth