ashander / ftprime

Forward-time simulation of the msprime data structure (for development)
2 stars 1 forks source link

Selective sweep example #48

Closed hyanwong closed 6 years ago

hyanwong commented 6 years ago

Selective sweep example. Still needs some testing

petrelharp commented 6 years ago

nice, glad to have it.

ashander commented 6 years ago

Thanks!

  1. It would be cool to have this scenario/example be also added to the examples.py script. We could envision several example selection/mutation sceanrios being possible through one interface python examples.py --scenario <SCENARIO NAME>. What do you think?

  2. As you mentioned elsewhere, some uses of the script cause an error:

 $ python3 ./selective_sweep.py -of 0.99 100 -of 0.5 -d 123
simuPOP Version 1.1.8.3 : Copyright (c) 2004-2016 Bo Peng
Revision 4553 (Feb 11 2017) for Python 3.5.3 (64bit, 1thread)
Random Number Generator is set to mt19937 with random seed 0x7014829b339a476f.
This is the standard mutant allele version with 256 maximum allelic states.
For more information, please visit http://simupop.sourceforge.net,
or email simupop-list@lists.sourceforge.net (subscription required).
Options:
Namespace(chrom_length=1000, dominance_coefficient=0.5, logfile='-', max_generations=1000000, neut_mut_rate=1e-07, nsamples=100, output_frequency=[['0.99', '100'], ['0.5']], popsize=5000, recomb_rate=1e-07, seed=123, selection_coefficient=0.1, simplify_interval=500, treefile_prefix='sweepfile')
13:30:35 10/26/17 PDT
----------
Gen:  0. Focal allele at freq 0.000000. Neutral sites #seg/#fixed: 1 / 0
Reintroduced target variant at generation 0
Reintroduced target variant at generation 12
Reintroduced target variant at generation 16
Reintroduced target variant at generation 17
Reintroduced target variant at generation 27
Reintroduced target variant at generation 29
Gen: 50. Focal allele at freq 0.000400. Neutral sites #seg/#fixed: 5 / 0
Reintroduced target variant at generation 54
Reintroduced target variant at generation 55
Reintroduced target variant at generation 58
Gen: 100. Focal allele at freq 0.015900. Neutral sites #seg/#fixed: 12 / 0
Gen: 150. Focal allele at freq 0.418900. Neutral sites #seg/#fixed: 12 / 0
Loaded into tree sequence!
13:30:53 10/26/17 PDT
----------
Written out samples to sweepfile0.5.hdf5
13:30:53 10/26/17 PDT
----------
Traceback (most recent call last):
  File "/home/jaime/lib/ftprime/manuscript/sims/src/ftprime/ftprime/recomb_collector.py", line 138, in collect_recombs
    children=(child_chrom,))
  File "/home/jaime/lib/ftprime/manuscript/sims/src/ftprime/ftprime/argrecorder.py", line 223, in add_record
    ".add_individual().")
ValueError: Parent 1657941's birth time has not been recorded with .add_individual().
Traceback (most recent call last):
  File "./selective_sweep.py", line 240, in <module>
    gen = args.max_generations
  File "/home/jaime/miniconda3/envs/ftprime-dev/lib/python3.5/site-packages/simuPOP/__init__.py", line 406, in evolve_pop
    gen = simu.evolve(initOps, preOps, matingScheme, postOps, finalOps, gen)
RuntimeError: Failed to send output to a function.
ashander commented 6 years ago

I'll take a closer look later. For now I'll just note for now that it seems to be the combination of multiple -of that cause the issue above, not the values passed. For example, this also causes the error python3 ./selective_sweep.py -of 0.05 -of 0.1 -d 123

My guess is that in outputing the tree sequence we change the state of the tables in the RecombCollector instance rc in a way that breaks it for adding further individuals

hyanwong commented 6 years ago

The multiple -of values cause the msprime files to be output at intermediate times, so I guess there is something in the code that means that the following commands (lines 122-124) can't be run in the middle of the simulation without triggering the bug (or more precisely, the simulation can't carry on after these lines have been called).

diploid_samples = random.sample(pop.indInfo("ind_id"), args.nsamples)
rc.simplify(diploid_samples)
ts = rc.args.tree_sequence()
hyanwong commented 6 years ago

Also note that I left class GammaDistributedFitness in there in case it was useful to anyone else, but it is unused, and hence can probably be deleted (especially as it is present in the other example file)

ashander commented 6 years ago

Nice. Yes that block this is the issue!

The block below will strip out all individuals from the ts who are not named as samples

diploid_samples = random.sample(pop.indInfo("ind_id"), args.nsamples)
rc.simplify(diploid_samples)
ts = rc.args.tree_sequence()

Change the first line to

diploid_samples = pop.indInfo("ind_id")

Then the tree will still have all individuals being tracked.

ashander commented 6 years ago

Actually, you can just delete the first two lines of the block. Outputting the ts will simplify it to extant individuals anyway

hyanwong commented 6 years ago

Was just about to note that was the problem, thanks!

However, I'll probably want to run this on a very large population and only take sample subsets. Is there a way to output a subset of the individuals in msprime format without nixing the simulation as a whole?

hyanwong commented 6 years ago

I guess I could take a copy of the ts before simplification. I'm not sure whether ts.copy() works, though.

hyanwong commented 6 years ago

Hang on, I'm being dense here. I think I can just do ts.simplify(subsample) and it will return the simplified copy, while leaving the original intact.

ashander commented 6 years ago

I'm not sure that's the case, but check.

you may need to make a deep copy.

@jeromekelleher can confirm, but I think the current only way to make a deep copy of a ts is through the tables api: https://github.com/jeromekelleher/msprime/blob/ec55bc53645cc2233ed5ca0ce681a596cc361996/msprime/tables.py

On Thu, Oct 26, 2017 at 2:02 PM, Yan Wong notifications@github.com wrote:

Hang on, I'm being dense here. I think I can just do ts.simplify(subsample) and it will return the simplified copy, while leaving the original intact.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ashander/ftprime/pull/48#issuecomment-339800111, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOBUPatmEKFtImpTJigR7FaSkafXOks5swPN9gaJpZM4QIJ7B .

hyanwong commented 6 years ago

@ashander does the rc.args.tree_sequence() common return a tree sequence containing references to internal ftprime stuff then - e.g. if I modify it, will it mess up the rc tracking?

hyanwong commented 6 years ago

I've just tested, and it seems to work fine if I carry out simplify on the tree sequence, rather than on the RecombCollector. But there is a slight issue in that if I randomly select 2k tips from the treeseq, they will not be from k individuals.

To get around this, is there any way to specify a set of individuals in the population and map them to tips in the resulting msprime ts after calling rc.args.tree_sequence()? If so, I think can nail this.

ashander commented 6 years ago

@hyanwong you were right -- in msprime ts.simplify uses the tables api to make a deep copy so doesn't modify ts, in rc.args.tree_sequence we use ts.simplify so you'll be good with applying this patch

             diploid_samples = random.sample(pop.indInfo("ind_id"), args.nsamples)
-            rc.simplify(diploid_samples)
-            ts = rc.args.tree_sequence()
+            ts = rc.args.tree_sequence(diploid_samples)
hyanwong commented 6 years ago

Ah, that's an even better fix. So rc.args.tree_sequence() simplifies down using individual IDs (for which there are N), not msprime tip ids (of which there are 2N)?

petrelharp commented 6 years ago

The first step in rc.simplify is to convert samples (which should be diploid IDs) to the underlying haploid IDs: https://github.com/ashander/ftprime/blob/master/ftprime/recomb_collector.py#L163 , thus obtaining a tree sequence with 2N tips.

(fixed link)

hyanwong commented 6 years ago

@petrelharp thanks. I think perhaps that the call rc.args.tree_sequence(ids) uses haploid ids, rather than diploid ones, though. So I can't do rc.args.tree_sequence(pop.indInfo("ind_id"), args.nsamples) as @ashander suggested (I have tried and seem to get errors: File "/usr/local/lib/python3.5/site-packages/ftprime/argrecorder.py", line 187, in check_ids raise ValueError("Input ID " + str(u) + " not recorded.") ValueError: Input ID 112356.0 not recorded.). So I guess I need an external way to do the conversion, or perhaps be able to specify in rc.args.tree_sequence() whether the IDs are diploid or haploid ones?

petrelharp commented 6 years ago

As for returning the tree sequence with and without simplifying down to a set of samples, compare argrecorder.tree_sequence to argrecorder.simplify -- we don't have the first function written into recombcollector, but it'd be trivial to add (just rc.args.tree_sequence()).

petrelharp commented 6 years ago

Here you go:

class RecombCollector(object)

    # ...

    def tree_sequence(self, samples):
        """
        Returns a tree sequence, that retains only information relevant
        to the diploid individuals listed in `samples`.
        :param list samples: A list of diploid input individual IDs.
        """
        haploid_ids = [self.i2c(i,p) for i in samples for p in (0,1)]
        return self.args.tree_sequence(haploid_ids)
petrelharp commented 6 years ago

untested, obviously.

petrelharp commented 6 years ago

Do you agree, @ashander ?

ashander commented 6 years ago

@hyanwong sorry, my snippet above was misleading. only simplify, as @petrelharp mentioned, converts to diploids.

so you want (which you all have probably already figured out)

             diploid_samples = random.sample(pop.indInfo("ind_id"), args.nsamples)
-            rc.simplify(diploid_samples)
-            ts = rc.args.tree_sequence()
+            haploid_ids = [rc.i2c(i,p) for i in diploid_samples for p in (0,1)]
+            ts = rc.args.tree_sequence(haploid_ids)

edited to remove think-o Peter spotted

petrelharp commented 6 years ago

You meant haploid_ids on that last line there.

ashander commented 6 years ago

You meant haploid_ids on that last line there.

Yes good catch. Man the non-realtime nature of github comments is a bit frustrating

hyanwong commented 6 years ago

It feels dirty re-editing GH comments after hitting the button :) Especially someone else's! ;)

hyanwong commented 6 years ago

@petrelharp you need a return statement in your version too :)

Testing via monkey patching now. Will let you know.

petrelharp commented 6 years ago

wow, I didn't realize you could do that. (that I could edit your comment)

Jaime adds

this seems even more spooky and tbh a bad UI decision. anyone can edit anyone else's comment!?

hyanwong commented 6 years ago

Yes, a bit spooky.

Monkey patching shows that @petrelharp 's code (with a return) works, thanks for your help.

I've just pushed this into my code, but I guess you might want to take lines 16-25 and put them in the main codebase, @ashander? In which case I can remove them from the example file.

hyanwong commented 6 years ago

Sorry - just realised I could make the change to the main file (not example) in the PR. I'll do that.

ashander commented 6 years ago

sorry @hyanwong I just did it and pushed

hyanwong commented 6 years ago

OK, will revert - you're too quick :)

ashander commented 6 years ago

Was trying to keep up ;) anyway many thanks for the discussion and contribution

ashander commented 6 years ago

OK. One other thing. Way up at the top I said

  1. It would be cool to have this scenario/example be also added to the examples.py script. We could envision several example selection/mutation sceanrios being possible through one interface python examples.py --scenario .

What do you think @petrelharp and @hyanwong ? As I type this I'm taking back my earlier thinking --- it's probably good to just have a several scripts of different examples. There will be some duplication but they'll serve as nice, separate examples of how to do things rather than one behemoth script

hyanwong commented 6 years ago

FWIW, I find separate scripts are easier to read. On those lines, I guess I can take out the GammaDistributedFitness class from my example, which again, makes it clearer.

hyanwong commented 6 years ago

At the risk of proliferating scripts, you could have a run-examples script that simply calls each separate example script in turn, a bit like unit testing.

hyanwong commented 6 years ago

Hmm, having difficulty getting my ftprime installation to pick up the new tree_sequence method. Hang on a second before accepting my PR.

AttributeError: 'RecombCollector' object has no attribute 'tree_sequence'

ashander commented 6 years ago

@hyanwong a tip (which you may know) instead of merging master as you just did, you can incorporate the new master and revert all at once by:

git rebase -i master

will drop you into vim or whatever $EDITOR is, where you can delete these two lines and leave the remaining ones

pick 3585889 Monkey patch in a `tree_sequence` method on a RecombCollector
pick 1ce4d0b Add tree_sequence functionality directly into RecombCollector

finally you do

git push -f origin selective-sweep-example
hyanwong commented 6 years ago

@ashander thanks. I'm always a bit wary with git, so I try to stick to what I know works for me. It seems rather too easy to cock up with new commands that I have never tried before.

ashander commented 6 years ago

Hmm, having difficulty getting my ftprime installation to pick up the new tree_sequence method.

You should reinstall using pip install --upgrade -e . (run in the project root). This will make sure change to code in main module takes effect immediately

hyanwong commented 6 years ago

Duh, I was installing using pip not pip3. So yes, my code does now work with the updated master.

I haven't actually checked whether it is doing the right thing biologically yet! But at least it doesn't crash. The number of segregating / fixed sites seems rather low, but I guess this is reporting the sites that have been generated during the simulation, rather than any sites which were imposed during the init_ts setup. However, it looks to me as if the initial tree sequence setup using msprime.simulate does not inject mutations. Is that correct?

In which case I think I should probably incorporate mutations into msprime.simulate at the same rate as specified in the forwards simulation part, otherwise the sequences will be less diverse than expected. Does this seem justifiable, @ashander?

hyanwong commented 6 years ago

I've just added mutation & recombination into the initial msprime.simulate() setup call, but I see that afterwards the code uses

# initially, population is monogenic
init_geno=[sim.InitGenotype(freq=1.0)]

Which negates the point really! Is there a way to tell simuPOP to start with the genotypes generated by msprime? E.g. can I just drop the init_geno call and will it work magically?

ashander commented 6 years ago

You're correct that the initial_ts has no mutations. Also the forward simulated mutations are not being put on the trees yet (sorry, forgot to mention that earlier). We plan to link them up but they aren't yet. Removing the line that mutates them from the forward sim will make it faster too

Currently, you can add neutral mutations to the whole tree during the stage when you write the trees. (After simplifying them.) That way the mutations aren't in the trees that RecombCollector is tracking and so won't be added more than once to a given section of the tree.

As for

Which negates the point really! Is there a way to tell simuPOP to start with the genotypes generated by msprime? E.g. can I just drop the init_geno call and will it work magically?

I don't think anything will work magically. But not simulating any neutral muts in forward time should avoid this issue. (And you're already managing the initial freq of the selected allele, yes?)

On Thu, Oct 26, 2017 at 3:10 PM, Yan Wong notifications@github.com wrote:

I've just added mutation & recombination into the initial msprime.simulate() setup call, but I see that afterwards the code uses

initially, population is monogenic

init_geno=[sim.InitGenotype(freq=1.0)]

Which negates the point really! Is there a way to tell simuPOP to start with the genotypes generated by msprime? E.g. can I just drop the init_geno call and will it work magically?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ashander/ftprime/pull/48#issuecomment-339815784, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOL2zBv7n8_bEu8Um_f8ikqDNbnO0ks5swQNSgaJpZM4QIJ7B .

hyanwong commented 6 years ago

Oh, of course. I'm being stupid. No point in adding neutral mutations beforehand, when they can simply be added once we know the trees (since the prob of mutation will be proportional to branch length, regardless of selection).

I guess that means there isn't much point ever using the SNPMutator functionality?

So how do I add mutations on afterwards? I guess this is an msprime thing that I should know, but perhaps you can say off the top of your head?

hyanwong commented 6 years ago

p.s. I guess it's still worth specifying the recombination rate in the initial msprime.simulate command, isn't it? Some of the intact haplotype lengths present in the init stages may still end up relevant in the final output files. At the moment I'm doing

recombination_map = msprime.RecombinationMap.uniform_map(length, args.recomb_rate, length)

init_ts = msprime.simulate(2*len(first_gen),  recombination_map=recombination_map)
ashander commented 6 years ago
  1. Here's an example function that will add mutations to an existing tree using tables api. BUT it's from before the Edgesets->Edges transition so you'll need to modify it:
def add_mutations(ts, mut_seed, mut_rate):                                                                                                  
    rng = msprime.RandomGenerator(mut_seed)                                                                                                 
    nodes = msprime.NodeTable()                                                                                                             
    edgesets = msprime.EdgesetTable()                                                                                                       
    sites = msprime.SiteTable()                                                                                                             
    mutations = msprime.MutationTable()                                                                                                     
    ts.dump_tables(nodes=nodes, edgesets=edgesets, sites=sites)                                                                             
    mutgen = msprime.MutationGenerator(rng, mut_rate)                                                                                       
    mutgen.generate(nodes, edgesets, sites, mutations)                                                                                      

    return msprime.load_tables(                                                                                                             
        nodes=nodes, edgesets=edgesets, sites=sites, mutations=mutations)          
  1. On recombination: seems wise to have a consistent recomb rate on the initial and fwd simulated ts
hyanwong commented 6 years ago

Thanks for (1). One thorny issue, although it doesn't matter for my use-case, is that if we add mutations every time we output at TS, we will add a different set of mutations every time. So although the trees will be taken from successive sampling of the same population, the mutations will not be. I can't quite see how to get around this. I'm not sure if it is valid to simply add to the set of mutations in later generations.

For (2), I agree completely. It might be worth adjusting the other example files so that they incorporate the recombination rate specified for the forwards simulation into the msprime.simulate() setup stage too? I guess at the moment it must be using the default recombination rate, which will almost certainly differ from the user-specified one. The use of the msprime.RecombinationMap.uniform_map call is taken from some of Jerome's code where we have finite sites for recombination, so it is probably the right way to go.

codecov[bot] commented 6 years ago

Codecov Report

Merging #48 into master will decrease coverage by 18.39%. The diff coverage is 33.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #48      +/-   ##
==========================================
- Coverage   93.11%   74.72%   -18.4%     
==========================================
  Files           4        4              
  Lines         247      273      +26     
==========================================
- Hits          230      204      -26     
- Misses         17       69      +52
Impacted Files Coverage Δ
ftprime/recomb_collector.py 86.95% <33.33%> (-7.64%) :arrow_down:
ftprime/benchmarker.py 45.83% <0%> (-50.01%) :arrow_down:
ftprime/argrecorder.py 73.59% <0%> (-18.25%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7b985a2...1c72d75. Read the comment docs.

jeromekelleher commented 6 years ago

Phew, this was quite a journey! I think the questions re the msprime API were resolved though, right? Tree sequences are immutable, so simplify always returns a new tree sequence.

hyanwong commented 6 years ago

Thanks @jeromekelleher. Yes, ts mutability was answered. A remaining (minor) question is if it is possible to add mutations after the simulation to a sampled TS from a large population, then continue evolving the population forward in time, make another sample, and add mutations to that while retaining the original mutations on the appropriate branches.

I suspect this is very difficult, and if we want to retain mutations like this, it would be far easier to track them forward in time, rather than figure out how to partially retain stuff. But perhaps I'm missing something.

jeromekelleher commented 6 years ago

Thanks @jeromekelleher. Yes, ts mutability was answered. A remaining (minor) question is if it is possible to add mutations after the simulation to a sampled TS from a large population, then continue evolving the population forward in time, make another sample, and add mutations to that while retaining the original mutations on the appropriate branches.

There's no reason we can't do this, but there will need to be plumbing in ftprime to allow us add in the mutations and keep track of them. The simplify machinery will keep any mutations that are defined around perfectly happily.

This seems like quite a complicated way of doing things though... Why not run the whole simulation forwards time, throw on the mutations, and then afterwards look at the slices of the tree sequence that you are interested in? So, if you wanted to see what the tree sequence looked like at time t, just build a new tree sequence where you throw away all edges that are younger than t. (This is easy with the tables API; I can talk you through it.)

You should do this on the unsimplified tree sequence though, or it'll get complicated trying to figure out who your samples should be. This means that you get the raw data back from ftprime that contains everything that every happened. Throw mutations down on this, and then do surgery figuring out what slices you want of it. You can them simplify the slices you're interested in, and these will be nice small tree sequences. The full history will be big, but I think you're only interested in fairly modest examples anyway, so it should be OK.