ashander / ftprime

Forward-time simulation of the msprime data structure (for development)
2 stars 1 forks source link

Complexity of examples #52

Open ashander opened 6 years ago

ashander commented 6 years ago

Ideally we'll have a few stand-alone example scripts (eg as contributed by @hyanwong in #48 ) but these will:

  1. not be too complex involving lots of library-ish code and new objects or functions to interpret
  2. not depend on ad hoc statistical code outside of msprime
ashander commented 6 years ago

To answer question in https://github.com/ashander/ftprime/pull/48#issuecomment-340484513 @hyanwong it'd be nice to have examples that are relatively simple to read. I realize that your example grew out of an ongoing project though.

The current state is great (thanks for finding lots of bugs/issues!) but adding a lot more code complexity to the PR /branch would make things hard to understand as an ftprime example.

hyanwong commented 6 years ago

Yes, I agree about making them easy to read / simpler & not relying on external code. I feel a bit guilty about bloating the code. However, I do think it is worth having examples that can be called either as a python routine in their own right or as a main() routine using argparse. This needn't add much to the complexity if it is done right, I suspect. FWIW, I think I've done it in too hacky a manner in selective_sweep.py. Also, as an aside, I think it's also worth using import logging and logging.basicConfig(...) rather than your own code for log files (easier to turn this on and off)

ashander commented 6 years ago

Cool. Those are good comments and I agree separating code from main() is nice style. It's all a balancing act.

hyanwong commented 6 years ago

Code to allow SFS-like testing of example files is at https://github.com/hyanwong/ftprime/commit/40d01ae84484dc894e4508de2e8d43759328ce64

Since it updates the SFS incrementally, you will need to pass it a function to calculate various statistics iteratively. I use a hacked together script which could probably do with some tidying, something like (untested)

from SFS import incrementalSFS

def H(n):
    """Returns an approximate value of n-th harmonic number.

       http://en.wikipedia.org/wiki/Harmonic_number
    """
    # Euler-Mascheroni constant
    gamma = 0.57721566490153286060651209008240243104215933593992
    return gamma + math.log(n) + 0.5/n - 1./(12*n**2) + 1./(120*n**4)

def Theta_W(sfs, n):
    """
    Calculate the (mutationless) Watterson estimator
    """
    return sum(sfs.values()) / H(n)

def Theta_pi(sfs, n):
    """
    Calculate the (mutationless) nucleotide diversity
    2 sum(i(n−i)*ξi) / n / (n-1)
    """
    return 2 * sum([i*(n-i)*sfs[i] for i in sfs.keys()]) / n / (n-1)

def Theta_H(sfs, n):
    """
    Calculate the theta estimate used in Fay & Wu's H
    2 * sum(i^2 ξi) / n / (n-1)
    """
    return 2 * sum([(i**2) * sfs[i] for i in sfs.keys()]) / n / (n-1)

def save_estimates(start, end, sfs, n):
    watt = Theta_W(sfs, n)
    ndiv = Theta_pi(sfs, n)
    tH = Theta_H(sfs, n)
    T_W.append((watt, end-start))
    T_pi.append((ndiv, end-start))
    T_H.append((tH, end-start))
    D.append(((end+start)/2, ndiv - watt))
    FayWuH.append(((end+start)/2, ndiv - tH))

ts = msprime.load("sweepfileXXX.hdf5")
#globals (yuck)
T_W, T_pi, T_H, D, FayWuH = [], [], [], [], []
incrementalSFS(ts, save_estimates)

#plot D here using something like 
#import matplotlib
#plt.plot(*zip(*D))
#plt.set_xlim(0,ts.get_sequence_length())