BoPeng / simuPOP

A general-purpose forward-time population genetics simulation environment.
http://bopeng.github.io/simuPOP/
GNU General Public License v2.0
29 stars 12 forks source link

Faster alternative to `DiscardIf` to reduce calls to RNG? #31

Closed ashander closed 7 years ago

ashander commented 7 years ago

I'm seeking to discard individuals based on random draws. One method to do this is DiscardIf but it is called for every individual, every generation. Writing a naive function to do the discards will be slow, then as it calls a random number generator many times.

Are there other ways to accomplish this that avoid calling a function once for every individual? Or, do we we just have to be clever in constructing a function to pass to DiscardIf to reduce calls to the random number generator?

The first example below demonstrates that DiscardIf is called every time. The second is slightly more clever and only generates random numbers once per generation.

naive

import simuPOP as sim
import numpy as np

class Discarder():
    '''Discard individuals based on 'mark' info; keep track of calls'''
    def __init__(self, survival):
        self._s = survival
        self._calls = 0

    def __call__(self, mark):
        self._calls += 1
        if mark == 0:
            return False
        elif np.random.random() > self._s[int(mark) - 1]:
            return True
        else:
            raise Exception

    def __str__(self):
        return str(self._calls)

pop = sim.Population(1000, infoFields='mark')
sim.initInfo(pop, lambda: np.random.randint(1, 20), infoFields='mark')
survival = np.random.beta(1, 1, 20)
discarder = Discarder(survival)
pop.evolve(postOps=[sim.DiscardIf(discarder)], gen=1)
print(discarder)
# 1000

pre-computed random numbers

class FastDiscarder(Discarder):
    '''Discard individuals based on 'mark' info; keep track of calls to numpy'''
    def __init__(self, survival):
        self._s = survival
        self._calls = 0
        self._total_calls = 0
        self._info = None

    def __call__(self, pop):
        self._total_calls += 1
        try:
            return next(self._info)
        except TypeError:
            self._info = self.compute_stuff(pop)
            return next(self._info)
        except StopIteration:
            self._info = None

    def compute_stuff(self, pop):
        self._calls += 1
        return (i.mark != 0 and p > self._s[int(i.mark) - 1]
                for i, p in zip(pop.allIndividuals(),
                                np.random.random(size=pop.popSize())))

    def __str__(self):
        return str(self._calls)

pop = sim.Population(1000, infoFields='mark')
sim.initInfo(pop, lambda: np.random.randint(1, 20), infoFields='mark')
survival = np.random.beta(1, 1, 20)
discarder = FastDiscarder(survival)
pop.evolve(postOps=[sim.DiscardIf(discarder)], gen=1)
print(discarder)
# 1
print(discarder._total_calls)
# 1000
BoPeng commented 7 years ago

From what I can see

class Discarder():
    '''Discard individuals based on 'mark' info; keep track of calls'''
    def __init__(self, survival):
        self._s = survival
        self._calls = 0

    def __call__(self, mark):
        self._calls += 1
        ## this can be done in batch (e.g. remove virtual subpopulation, or with fitness 0 etc
        if mark == 0:
            return False
       ## this looks like a fitness function. 
        elif np.random.random() > self._s[int(mark) - 1]:
            return True
        else:
            raise Exception

    def __str__(self):
        return str(self._calls)

It looks to me that you can assign fitness to individuals using something like

InfoExec('fitness=_s[mark]-1')

(_s need to be available as population variable), or

pop.setIndInfo([_s[ind.mark]-1 for ind in pop.individusls()], 'fitness')

(wrapped inside a PyOperator) and individuals would be selected according to fitness values. No random number generator is needed.

The latter method could be made more efficient if you can define VSPs by mark, and use something like

pop.setIndInfo(some_value, vsp_id)

Disclaimer: Just ideas, check reference manual for exact syntax.

ashander commented 7 years ago

Thanks! You're right that this is fitness -- it's survival so there will be viability selection on the value of mark. Batch removal of some values would help, and using simuPOP's selection/fitness features would be great but my end goal is for this survival (variation in viability) to also affect the population size.

Am I correct that there's no way to do this automatically with 'fitness'? (That's ok, I could use the values of mark to determine population size separately. But DiscardIf has the nice, realistic feature of non-survivors actually being removed from the population.)

BoPeng commented 7 years ago

So it sounds like you are looking for a non-during-mating selection operator? Basically removing individually with a probability of fitness? This was discussed quite a few years ago. The counter argument was that fitness in population genetics is "relative fitness" and is not probabilities. An implementation would either assume the fitness as probabilities, or scale fitness to probabilities, both are not good.

Now, if we assume that there is an information field with probabilities, it is indeed nice to be able to remove individuals according to the survival probabilities (thinking of ecological models that wipe out populations directly, not through effect on mating). Let me think of a good interface for this feature.

BoPeng commented 7 years ago

Would it be good enough if DiscardIf() accepts a removal probability (1 - survival probability) in addition to True or False? This looks like a very natural extension to its function, namely True (1) for discard, False (0) for keep, and value between (0-1) to remove with probability. This would allow you to pass an expression (e.g. 1-_s[mark]) as the probability for removal, or a function that returns a probability.

ashander commented 7 years ago

So it sounds like you are looking for a non-during-mating selection operator? Basically removing individually with a probability of fitness? This was discussed quite a few years ago. The counter argument was that fitness in population genetics is "relative fitness" and is not probabilities. An implementation would either assume the fitness as probabilities, or scale fitness to probabilities, both are not good.

Exactly. The counter argument makes sense for pop gen, but as you point out non-mating selection would be useful in many ecological models.

Would it be good enough if DiscardIf() accepts a removal probability (1 - survival probability) in addition to True or False? This looks like a very natural extension to its function, namely True (1) for discard, False (0) for keep, and value between (0-1) to remove with probability. This would allow you to pass an expression (e.g. 1-_s[mark]) as the probability for removal, or a function that returns a probability.

This would work. I agree it would be a natural extension to the current interface. It would also avoid always having to manage random number generation and conditionals at the python level for discarding.

BoPeng commented 7 years ago

Please test and let me know if it works as expected. The DiscardIf(0.1, subPops=vsp) usage seems to be particularly interesting in removing individuals in a VSP at random.

ashander commented 7 years ago

Will do, thanks!

ashander commented 7 years ago

Seems to work great with some informal testing. (But I haven't tried the VSP specific stuff).

Note, though, that at least for a simple case without VSPs the precomputing method is faster. There's a benchmark below. (Note I've had to redefine the way FastDiscarder works relative to above.)

For 1000 gens

naive: 5.222766880004201
fast discard: 1.107649122015573
new feature: 4.0499952919781
class FastDiscarder(Discarder):
    '''Discard individuals based on 'mark' info; keep track of calls to numpy'''
    def __init__(self, survival):
        self._s = survival
        self._calls = 0
        self._total_calls = 0
        self._info = iter(())

    def __call__(self, pop):
        self._total_calls += 1
        try:
            return next(self._info)
        except StopIteration:
            self._info = self.compute_stuff(pop)
            return next(self._info)

    def compute_stuff(self, pop):
        self._calls += 1
        return (i.mark != 0 and p > self._s[int(i.mark) - 1]
                for i, p in zip(pop.allIndividuals(),
                                np.random.random(size=pop.popSize())))

    def __str__(self):
        return str(self._calls)

benchmark

def naive():
    pop = sim.Population(1000, infoFields='mark')
    sim.initInfo(pop, lambda: np.random.randint(1, 20), infoFields='mark')
    survival = np.random.beta(1, 1, 20)
    discarder = Discarder(survival)
    # pop.evolve(postOps=[sim.DiscardIf(discarder)], gen=1)
    # print(discarder)
    # 1000
    return pop, discarder

def smarter():
    pop = sim.Population(10, infoFields='mark')
    sim.initInfo(pop, lambda: np.random.randint(1, 20), infoFields='mark')
    survival = np.random.beta(1, 1, 20)
    discarder = FastDiscarder(survival)
    # pop.evolve(postOps=[sim.DiscardIf(discarder)], gen=10)
    # sim.dump(pop)
    # print(discarder)
    # 1
    # print(discarder._total_calls)
    # 1000
    return pop, discarder

def new_way():
    pop = sim.Population(1000, infoFields='mark')
    sim.initInfo(pop, lambda: np.random.beta(1, 1), infoFields='mark')
    # with new interface, marks can act directly as survival
    # pop.evolve(postOps=[sim.DiscardIf(cond='mark')], gen=10)
    return pop

if __name__ == '__main__':
    import timeit
    print(timeit.timeit("pop.evolve(postOps=[sim.DiscardIf(d)], gen=1000)",
                        number=10,
                        setup="from __main__ import naive; import simuPOP as sim; pop, d = naive()"))
    print(timeit.timeit("pop.evolve(postOps=[sim.DiscardIf(d)], gen=1000)",
                        number=10,
                        setup="from __main__ import smarter; import simuPOP as sim; pop, d = smarter()"))
    print(timeit.timeit("pop.evolve(postOps=[sim.DiscardIf(cond='mark')], gen=1000)",
                        number=10,
                        setup="from __main__ import new_way; import simuPOP as sim; pop = new_way()"))
BoPeng commented 7 years ago

Well, this is not very surprising because the cond='mark' method is not just a field, it is an evaluation of an expression (you can do every complex evaluation over there) and internally simuPOP needs to prepare a dictionary for each individual and evaluate the expression for each individual.

This can of course be optimized as I have done that for some other cases, just not for DiscardIf. I have submitted a patch for the case when a field name is specified, it should significantly boost the performance for your case.

BoPeng commented 7 years ago

For the record, https://github.com/BoPeng/simuPOP/commit/7419ad9d7e17631bf5ab05bae4f0a36d70ad1d53

ashander commented 7 years ago

Ah, thanks! I think I'll be able to use this directly. Even if end up needing something more complex, this will be a very helpful example if I try to submit a patch to optimize it.

As for the speed, I knew I must be missing something. So, do I have this right:

Absent some optimization at the c++ level as you've done here an expression will be setup and evaluated at the python level once for each individual in each generation (for the discarder or, say, InfoEval) or once for each generation for the whole population (for PyEval). (Based on docs for expression and statements )

BoPeng commented 7 years ago

Yes, you are absolutely right. These expression evaluations can be costly because a dictionary needs to be setup for the evaluation of each expression at the Python level. This is usually acceptable if it happens at each generation (PyEval), but can be slow for evaluating for each individual (InfoEval and InfoExec and all those hybrid during-mating operators).

The patch is very simple and optimizes the most frequently used case of name, but in theory most expressions (e.g. name + 0.1, name1 + name2, as long as no Python objects are involved) can be optimized by parsing the expressions and evaluate them at the C/C++ level. It is only a matter of how much time you have to spend on these things. :-)