Closed hyanwong closed 6 years ago
nice, glad to have it.
Thanks!
It would be cool to have this scenario/example be also added to the examples.py
script. We could envision several example selection/mutation sceanrios being possible through one interface python examples.py --scenario <SCENARIO NAME>
. What do you think?
As you mentioned elsewhere, some uses of the script cause an error:
$ python3 ./selective_sweep.py -of 0.99 100 -of 0.5 -d 123
simuPOP Version 1.1.8.3 : Copyright (c) 2004-2016 Bo Peng
Revision 4553 (Feb 11 2017) for Python 3.5.3 (64bit, 1thread)
Random Number Generator is set to mt19937 with random seed 0x7014829b339a476f.
This is the standard mutant allele version with 256 maximum allelic states.
For more information, please visit http://simupop.sourceforge.net,
or email simupop-list@lists.sourceforge.net (subscription required).
Options:
Namespace(chrom_length=1000, dominance_coefficient=0.5, logfile='-', max_generations=1000000, neut_mut_rate=1e-07, nsamples=100, output_frequency=[['0.99', '100'], ['0.5']], popsize=5000, recomb_rate=1e-07, seed=123, selection_coefficient=0.1, simplify_interval=500, treefile_prefix='sweepfile')
13:30:35 10/26/17 PDT
----------
Gen: 0. Focal allele at freq 0.000000. Neutral sites #seg/#fixed: 1 / 0
Reintroduced target variant at generation 0
Reintroduced target variant at generation 12
Reintroduced target variant at generation 16
Reintroduced target variant at generation 17
Reintroduced target variant at generation 27
Reintroduced target variant at generation 29
Gen: 50. Focal allele at freq 0.000400. Neutral sites #seg/#fixed: 5 / 0
Reintroduced target variant at generation 54
Reintroduced target variant at generation 55
Reintroduced target variant at generation 58
Gen: 100. Focal allele at freq 0.015900. Neutral sites #seg/#fixed: 12 / 0
Gen: 150. Focal allele at freq 0.418900. Neutral sites #seg/#fixed: 12 / 0
Loaded into tree sequence!
13:30:53 10/26/17 PDT
----------
Written out samples to sweepfile0.5.hdf5
13:30:53 10/26/17 PDT
----------
Traceback (most recent call last):
File "/home/jaime/lib/ftprime/manuscript/sims/src/ftprime/ftprime/recomb_collector.py", line 138, in collect_recombs
children=(child_chrom,))
File "/home/jaime/lib/ftprime/manuscript/sims/src/ftprime/ftprime/argrecorder.py", line 223, in add_record
".add_individual().")
ValueError: Parent 1657941's birth time has not been recorded with .add_individual().
Traceback (most recent call last):
File "./selective_sweep.py", line 240, in <module>
gen = args.max_generations
File "/home/jaime/miniconda3/envs/ftprime-dev/lib/python3.5/site-packages/simuPOP/__init__.py", line 406, in evolve_pop
gen = simu.evolve(initOps, preOps, matingScheme, postOps, finalOps, gen)
RuntimeError: Failed to send output to a function.
I'll take a closer look later. For now I'll just note for now that it seems to be the combination of multiple -of
that cause the issue above, not the values passed. For example, this also causes the error python3 ./selective_sweep.py -of 0.05 -of 0.1 -d 123
My guess is that in outputing the tree sequence we change the state of the tables in the RecombCollector
instance rc
in a way that breaks it for adding further individuals
The multiple -of
values cause the msprime files to be output at intermediate times, so I guess there is something in the code that means that the following commands (lines 122-124) can't be run in the middle of the simulation without triggering the bug (or more precisely, the simulation can't carry on after these lines have been called).
diploid_samples = random.sample(pop.indInfo("ind_id"), args.nsamples)
rc.simplify(diploid_samples)
ts = rc.args.tree_sequence()
Also note that I left class GammaDistributedFitness
in there in case it was useful to anyone else, but it is unused, and hence can probably be deleted (especially as it is present in the other example file)
Nice. Yes that block this is the issue!
The block below will strip out all individuals from the ts who are not named as samples
diploid_samples = random.sample(pop.indInfo("ind_id"), args.nsamples)
rc.simplify(diploid_samples)
ts = rc.args.tree_sequence()
Change the first line to
diploid_samples = pop.indInfo("ind_id")
Then the tree will still have all individuals being tracked.
Actually, you can just delete the first two lines of the block. Outputting the ts will simplify it to extant individuals anyway
Was just about to note that was the problem, thanks!
However, I'll probably want to run this on a very large population and only take sample subsets. Is there a way to output a subset of the individuals in msprime format without nixing the simulation as a whole?
I guess I could take a copy of the ts before simplification. I'm not sure whether ts.copy() works, though.
Hang on, I'm being dense here. I think I can just do ts.simplify(subsample) and it will return the simplified copy, while leaving the original intact.
I'm not sure that's the case, but check.
you may need to make a deep copy.
@jeromekelleher can confirm, but I think the current only way to make a deep copy of a ts is through the tables api: https://github.com/jeromekelleher/msprime/blob/ec55bc53645cc2233ed5ca0ce681a596cc361996/msprime/tables.py
On Thu, Oct 26, 2017 at 2:02 PM, Yan Wong notifications@github.com wrote:
Hang on, I'm being dense here. I think I can just do ts.simplify(subsample) and it will return the simplified copy, while leaving the original intact.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ashander/ftprime/pull/48#issuecomment-339800111, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOBUPatmEKFtImpTJigR7FaSkafXOks5swPN9gaJpZM4QIJ7B .
@ashander does the rc.args.tree_sequence()
common return a tree sequence containing references to internal ftprime stuff then - e.g. if I modify it, will it mess up the rc tracking?
I've just tested, and it seems to work fine if I carry out simplify
on the tree sequence, rather than on the RecombCollector. But there is a slight issue in that if I randomly select 2k tips from the treeseq, they will not be from k individuals.
To get around this, is there any way to specify a set of individuals in the population and map them to tips in the resulting msprime ts after calling rc.args.tree_sequence()
? If so, I think can nail this.
@hyanwong you were right -- in msprime ts.simplify
uses the tables api to make a deep copy so doesn't modify ts
, in rc.args.tree_sequence
we use ts.simplify
so you'll be good with applying this patch
diploid_samples = random.sample(pop.indInfo("ind_id"), args.nsamples)
- rc.simplify(diploid_samples)
- ts = rc.args.tree_sequence()
+ ts = rc.args.tree_sequence(diploid_samples)
Ah, that's an even better fix. So rc.args.tree_sequence()
simplifies down using individual IDs (for which there are N), not msprime tip ids (of which there are 2N)?
The first step in rc.simplify
is to convert samples
(which should be diploid IDs) to the underlying haploid IDs: https://github.com/ashander/ftprime/blob/master/ftprime/recomb_collector.py#L163 , thus obtaining a tree sequence with 2N tips.
(fixed link)
@petrelharp thanks. I think perhaps that the call rc.args.tree_sequence(ids)
uses haploid ids, rather than diploid ones, though. So I can't do rc.args.tree_sequence(pop.indInfo("ind_id"), args.nsamples)
as @ashander suggested (I have tried and seem to get errors: File "/usr/local/lib/python3.5/site-packages/ftprime/argrecorder.py", line 187, in check_ids raise ValueError("Input ID " + str(u) + " not recorded.") ValueError: Input ID 112356.0 not recorded.
). So I guess I need an external way to do the conversion, or perhaps be able to specify in rc.args.tree_sequence()
whether the IDs are diploid or haploid ones?
As for returning the tree sequence with and without simplifying down to a set of samples, compare argrecorder.tree_sequence to argrecorder.simplify -- we don't have the first function written into recombcollector, but it'd be trivial to add (just rc.args.tree_sequence()
).
Here you go:
class RecombCollector(object)
# ...
def tree_sequence(self, samples):
"""
Returns a tree sequence, that retains only information relevant
to the diploid individuals listed in `samples`.
:param list samples: A list of diploid input individual IDs.
"""
haploid_ids = [self.i2c(i,p) for i in samples for p in (0,1)]
return self.args.tree_sequence(haploid_ids)
untested, obviously.
Do you agree, @ashander ?
@hyanwong sorry, my snippet above was misleading. only simplify, as @petrelharp mentioned, converts to diploids.
so you want (which you all have probably already figured out)
diploid_samples = random.sample(pop.indInfo("ind_id"), args.nsamples)
- rc.simplify(diploid_samples)
- ts = rc.args.tree_sequence()
+ haploid_ids = [rc.i2c(i,p) for i in diploid_samples for p in (0,1)]
+ ts = rc.args.tree_sequence(haploid_ids)
edited to remove think-o Peter spotted
You meant haploid_ids
on that last line there.
You meant
haploid_ids
on that last line there.
Yes good catch. Man the non-realtime nature of github comments is a bit frustrating
It feels dirty re-editing GH comments after hitting the button :) Especially someone else's! ;)
@petrelharp you need a return statement in your version too :)
Testing via monkey patching now. Will let you know.
wow, I didn't realize you could do that. (that I could edit your comment)
Jaime adds
this seems even more spooky and tbh a bad UI decision. anyone can edit anyone else's comment!?
Yes, a bit spooky.
Monkey patching shows that @petrelharp 's code (with a return) works, thanks for your help.
I've just pushed this into my code, but I guess you might want to take lines 16-25 and put them in the main codebase, @ashander? In which case I can remove them from the example file.
Sorry - just realised I could make the change to the main file (not example) in the PR. I'll do that.
sorry @hyanwong I just did it and pushed
OK, will revert - you're too quick :)
Was trying to keep up ;) anyway many thanks for the discussion and contribution
OK. One other thing. Way up at the top I said
- It would be cool to have this scenario/example be also added to the examples.py script. We could envision several example selection/mutation sceanrios being possible through one interface python examples.py --scenario
.
What do you think @petrelharp and @hyanwong ? As I type this I'm taking back my earlier thinking --- it's probably good to just have a several scripts of different examples. There will be some duplication but they'll serve as nice, separate examples of how to do things rather than one behemoth script
FWIW, I find separate scripts are easier to read. On those lines, I guess I can take out the GammaDistributedFitness
class from my example, which again, makes it clearer.
At the risk of proliferating scripts, you could have a run-examples
script that simply calls each separate example script in turn, a bit like unit testing.
Hmm, having difficulty getting my ftprime installation to pick up the new tree_sequence method. Hang on a second before accepting my PR.
AttributeError: 'RecombCollector' object has no attribute 'tree_sequence'
@hyanwong a tip (which you may know) instead of merging master as you just did, you can incorporate the new master and revert all at once by:
git rebase -i master
will drop you into vim or whatever $EDITOR
is, where you can delete these two lines and leave the remaining ones
pick 3585889 Monkey patch in a `tree_sequence` method on a RecombCollector
pick 1ce4d0b Add tree_sequence functionality directly into RecombCollector
finally you do
git push -f origin selective-sweep-example
@ashander thanks. I'm always a bit wary with git, so I try to stick to what I know works for me. It seems rather too easy to cock up with new commands that I have never tried before.
Hmm, having difficulty getting my ftprime installation to pick up the new tree_sequence method.
You should reinstall using pip install --upgrade -e .
(run in the project root). This will make sure change to code in main module takes effect immediately
Duh, I was installing using pip
not pip3
. So yes, my code does now work with the updated master.
I haven't actually checked whether it is doing the right thing biologically yet! But at least it doesn't crash. The number of segregating / fixed sites seems rather low, but I guess this is reporting the sites that have been generated during the simulation, rather than any sites which were imposed during the init_ts
setup. However, it looks to me as if the initial tree sequence setup using msprime.simulate
does not inject mutations. Is that correct?
In which case I think I should probably incorporate mutations into msprime.simulate
at the same rate as specified in the forwards simulation part, otherwise the sequences will be less diverse than expected. Does this seem justifiable, @ashander?
I've just added mutation & recombination into the initial msprime.simulate() setup call, but I see that afterwards the code uses
# initially, population is monogenic
init_geno=[sim.InitGenotype(freq=1.0)]
Which negates the point really! Is there a way to tell simuPOP to start with the genotypes generated by msprime? E.g. can I just drop the init_geno
call and will it work magically?
You're correct that the initial_ts has no mutations. Also the forward simulated mutations are not being put on the trees yet (sorry, forgot to mention that earlier). We plan to link them up but they aren't yet. Removing the line that mutates them from the forward sim will make it faster too
Currently, you can add neutral mutations to the whole tree during the stage
when you write the trees. (After simplifying them.) That way the mutations
aren't in the trees that RecombCollector
is tracking and so won't be
added more than once to a given section of the tree.
As for
Which negates the point really! Is there a way to tell simuPOP to start with the genotypes generated by msprime? E.g. can I just drop the init_geno call and will it work magically?
I don't think anything will work magically. But not simulating any neutral muts in forward time should avoid this issue. (And you're already managing the initial freq of the selected allele, yes?)
On Thu, Oct 26, 2017 at 3:10 PM, Yan Wong notifications@github.com wrote:
I've just added mutation & recombination into the initial msprime.simulate() setup call, but I see that afterwards the code uses
initially, population is monogenic
init_geno=[sim.InitGenotype(freq=1.0)]
Which negates the point really! Is there a way to tell simuPOP to start with the genotypes generated by msprime? E.g. can I just drop the init_geno call and will it work magically?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ashander/ftprime/pull/48#issuecomment-339815784, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOL2zBv7n8_bEu8Um_f8ikqDNbnO0ks5swQNSgaJpZM4QIJ7B .
Oh, of course. I'm being stupid. No point in adding neutral mutations beforehand, when they can simply be added once we know the trees (since the prob of mutation will be proportional to branch length, regardless of selection).
I guess that means there isn't much point ever using the SNPMutator functionality?
So how do I add mutations on afterwards? I guess this is an msprime thing that I should know, but perhaps you can say off the top of your head?
p.s. I guess it's still worth specifying the recombination rate in the initial msprime.simulate
command, isn't it? Some of the intact haplotype lengths present in the init stages may still end up relevant in the final output files. At the moment I'm doing
recombination_map = msprime.RecombinationMap.uniform_map(length, args.recomb_rate, length)
init_ts = msprime.simulate(2*len(first_gen), recombination_map=recombination_map)
def add_mutations(ts, mut_seed, mut_rate):
rng = msprime.RandomGenerator(mut_seed)
nodes = msprime.NodeTable()
edgesets = msprime.EdgesetTable()
sites = msprime.SiteTable()
mutations = msprime.MutationTable()
ts.dump_tables(nodes=nodes, edgesets=edgesets, sites=sites)
mutgen = msprime.MutationGenerator(rng, mut_rate)
mutgen.generate(nodes, edgesets, sites, mutations)
return msprime.load_tables(
nodes=nodes, edgesets=edgesets, sites=sites, mutations=mutations)
Thanks for (1). One thorny issue, although it doesn't matter for my use-case, is that if we add mutations every time we output at TS, we will add a different set of mutations every time. So although the trees will be taken from successive sampling of the same population, the mutations will not be. I can't quite see how to get around this. I'm not sure if it is valid to simply add to the set of mutations in later generations.
For (2), I agree completely. It might be worth adjusting the other example files so that they incorporate the recombination rate specified for the forwards simulation into the msprime.simulate()
setup stage too? I guess at the moment it must be using the default recombination rate, which will almost certainly differ from the user-specified one. The use of the msprime.RecombinationMap.uniform_map
call is taken from some of Jerome's code where we have finite sites for recombination, so it is probably the right way to go.
Merging #48 into master will decrease coverage by
18.39%
. The diff coverage is33.33%
.
@@ Coverage Diff @@
## master #48 +/- ##
==========================================
- Coverage 93.11% 74.72% -18.4%
==========================================
Files 4 4
Lines 247 273 +26
==========================================
- Hits 230 204 -26
- Misses 17 69 +52
Impacted Files | Coverage Δ | |
---|---|---|
ftprime/recomb_collector.py | 86.95% <33.33%> (-7.64%) |
:arrow_down: |
ftprime/benchmarker.py | 45.83% <0%> (-50.01%) |
:arrow_down: |
ftprime/argrecorder.py | 73.59% <0%> (-18.25%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 7b985a2...1c72d75. Read the comment docs.
Phew, this was quite a journey! I think the questions re the msprime API were resolved though, right? Tree sequences are immutable, so simplify always returns a new tree sequence.
Thanks @jeromekelleher. Yes, ts mutability was answered. A remaining (minor) question is if it is possible to add mutations after the simulation to a sampled TS from a large population, then continue evolving the population forward in time, make another sample, and add mutations to that while retaining the original mutations on the appropriate branches.
I suspect this is very difficult, and if we want to retain mutations like this, it would be far easier to track them forward in time, rather than figure out how to partially retain stuff. But perhaps I'm missing something.
Thanks @jeromekelleher. Yes, ts mutability was answered. A remaining (minor) question is if it is possible to add mutations after the simulation to a sampled TS from a large population, then continue evolving the population forward in time, make another sample, and add mutations to that while retaining the original mutations on the appropriate branches.
There's no reason we can't do this, but there will need to be plumbing in ftprime to allow us add in the mutations and keep track of them. The simplify machinery will keep any mutations that are defined around perfectly happily.
This seems like quite a complicated way of doing things though... Why not run the whole simulation forwards time, throw on the mutations, and then afterwards look at the slices of the tree sequence that you are interested in? So, if you wanted to see what the tree sequence looked like at time t, just build a new tree sequence where you throw away all edges that are younger than t. (This is easy with the tables API; I can talk you through it.)
You should do this on the unsimplified tree sequence though, or it'll get complicated trying to figure out who your samples should be. This means that you get the raw data back from ftprime that contains everything that every happened. Throw mutations down on this, and then do surgery figuring out what slices you want of it. You can them simplify the slices you're interested in, and these will be nice small tree sequences. The full history will be big, but I think you're only interested in fairly modest examples anyway, so it should be OK.
Selective sweep example. Still needs some testing