Possible birth counter race condition

markcoletti commented 1 year ago

Where I explore the possibility that there is a race condition for calculating the birth counter in the ASEA

markcoletti commented 1 year ago

Moved this to the backlog as it's not a priority. Though it's likely that there may be a race condition for involving DistributedIndividual's birth_id, I suspect it's somewhat mitigated by how we currently use Dask in both the asynchronous and synchronous implementations. However, I will be occasionally assessing run results for evidence of such race conditions, as possibly evidenced by duplicate or missing birth ids.

lukepmccombs commented 1 year ago

What was the observed behavior of the birth counter issue? I've found two things of note:

If you count the evaluated individuals, you will get init_pop_size + max_births + nonviable. This is because the initial population isn't counted into the births, which is intentional per the commit message, and nonviable individuals are counted down.

There was a bug where viable individuals were counted twice, which was probably left over from when you looked into it before.

markcoletti commented 1 year ago

Essentially had observed duplicate or skipped IDs and reasoned that this was likely due to a race condition.

Actually, the initial population should have birth IDs automatically assigned when created. It's just that when we fall into the the offspring creation loop that we start counting from where we left off with the last ID from the initial population. And then there's the bookkeeping for whether we want non-viable individuals to count towards the birth budget.

lukepmccombs commented 1 year ago

I don't see any duplicate birth IDs, but I think I know what might be the cause of the skipped IDs. If you use the crossover operator, but only pool 1 individual at a time in the pipeline, it discards the other individual. But it was still "birthed", and the ID counter is incremented for it. A pretty quick test of changing the pool size to 2 resulted in the highest observed ID at the end of the evolution process being 1 less than the total number observed (i.e. the 0-indexed max as you would get with itertools.count).

markcoletti commented 1 year ago

The other individual curing crossover is not discarded, but is rather sent the next time that operator is pulsed. That's why it has two different yield statements. This is actually a significant difference between LEAP and DEAP in that DEAP does discard one of the offspring, which introduces a source of genetic drift.

lukepmccombs commented 1 year ago

That is the case for the by-generational pipeline, but I don't believe it is so for the steady state pipeline. Each time the algorithm goes to produce offspring, the pipeline is called again, reinstantiating the crossover operator's local frame. If you produce multiple at once, they will all come from the same instance and receive the successive children, but not so between calls.

markcoletti commented 1 year ago

I don't think the crossover operator is re-instantiated between pipeline pulses; the operators are already instantiated when passed in via the offspring_pipeline argument to steady_state. You could probably cons up a quick test to observe what actually happens.

lukepmccombs commented 1 year ago

I made a test script with a stripped down version of uniform_crossover. The number of instantiations increases equally with number of births.

from toolz import curry
from leap_ec.ops import iteriter_op
from typing import Iterator
from distributed import Client

from leap_ec import ops
from leap_ec.representation import Representation
from leap_ec.binary_rep.problems import MaxOnes
from leap_ec.binary_rep.initializers import create_binary_sequence
from leap_ec.distrib.individual import DistributedIndividual

from leap_ec.distrib import asynchronous

MAX_BIRTHS = 1000

instantiation_count = 0

@curry
@iteriter_op
def dummy_crossover(next_individual: Iterator,
                      p_swap: float = 0.2, p_xover: float = 1.0) -> Iterator:
    """
    Structurally identical to uniform_crossover, except all crossover logic was removed.
    Function identity (i.e. params) was kept so curry behaves the same.
    """
    global instantiation_count
    instantiation_count += 1

    while True:
        parent1 = next(next_individual)
        parent2 = next(next_individual)
        yield parent1
        yield parent2

if __name__ == "__main__":
    pipeline = [
        ops.random_selection,
        dummy_crossover,
        ops.pool(size=1)
    ]

    with Client(n_workers=20) as client:
        final_pop = asynchronous.steady_state(
            client, max_births=MAX_BIRTHS, init_pop_size=5, pop_size=3,

            representation=Representation(
                initialize=create_binary_sequence(4),
                individual_cls=DistributedIndividual
            ),

            problem=MaxOnes(),
            offspring_pipeline=pipeline
        )

    print(f"Max Births: {MAX_BIRTHS}")
    print(f"Instantiation Count: {instantiation_count}")
    # Max Births: 1000
    # Instantiation Count: 1000

This is with my local branch that removes the double count on valid individuals.

markcoletti commented 1 year ago

Thanks for the test snippet! That's disappointing. (And fortunate that I don't generally use crossover!). We can discuss on Slack the possible workarounds. This may also have ramifications for other pipeline operators, though I can't think offhand of likely candidates.

lukepmccombs commented 1 year ago

Here is an example higher order function that could fix it. It keeps children in a local deque before they are yielded. The way yield works will prevent it from making excess children (only continuing execution when requested for the next), and on future calls any leftover children are yielded on that run instead.

def dummy_crossover():
    next_children = deque()

    @iteriter_op
    def _crossover_op(next_individual) -> Iterator:
        while True:
            while next_children:
                yield next_children.popleft()
            parent1 = next(next_individual)
            parent2 = next(next_individual)
            next_children.extend((parent1, parent2))

    return _crossover_op

SigmaX commented 1 year ago

@lukepmccombs Oh, yikes. So, am I understanding that we may have an issue where our code assumes that tools.curry() acts like a closure—returning a single instance so if it's called multiple times we can yield multiple values—but in reality it ends up re-instantiating every time it is executed?

I just reviewed your PR, but if it's aiming to solve the observation here I'm not sure why the deque is needed. What about a closure that behaves the way the @curry was intended to?

def dummy_crossover():
    next_children = deque()

    @iteriter_op
    def _crossover_op(next_individual) -> Iterator:
        while True:
            while next_children:
                yield next_children.popleft()
            parent1 = next(next_individual)
            parent2 = next(next_individual)
            yield parent1
            yield parent2

    return _crossover_op

lukepmccombs commented 1 year ago

The closure is needed to persist next_children's contents between calls. The individuals have to pass through the deque in order to be caught whenever the function closes; if we use a raw yield with the children, not having come from the deque, then if collection ends between the first and second child, the second will be lost.

Here is another version of the function that may be clearer (deque isn't particularly useful here, I thought it worked differently than it does. It technically is still appropriate, but so few are in the queue at a time that its merits don't show)

def dummy_crossover():
    second_child = None

    @iteriter_op
    def _crossover_op(next_individual) -> Iterator:
        nonlocal second_child
        if second_child is not None:
            # Return a child left from another run
            # Swap with None has to happen before the yield to be certain its executed
            ret_child, second_child = second_child, None
            yield ret_child

        while True:
            parent1 = next(next_individual)
            parent2 = next(next_individual)

            # The second child needs to be stored before yielding the first child (parent1)
            # Generators only execute code if necessary, so children will only be generated
            # if the first child needs to be yielded. That's why it doesn't need to be stored
            # in the closure as well.
            second_child = parent2
            yield parent1

            # Remove the second child from the closure and yield it if the generator does
            # get requested for it
            ret_child, second_child = second_child, None
            yield ret_child

    return _crossover_op

SigmaX commented 1 year ago

Copying my comment from #273 for summary purposes:

The observation is that when our crossover operators are used in a steady-state population model specifically, then because we are generating a single offspring at a time in each evolutionary step, we end up losing the functionality of generating a pair of two offspring for each pair of parents.

So the original operators work as designed—one instance will correctly generate two complementary offspring when you call it twice. The problem is that we are only calling the pipeline once in each step for steady-state algorithms, so the second offspring gets created but never yielded (thus inflating our birth count).

lukepmccombs commented 1 year ago

The issue is present in all of our pipelines, just most prevalent in steady state. It occurs when the pipeline is made to collect an odd number of individuals, which is almost always 1 in the case of steady state. An individual is birthed by the crossover operator, but discarded before it is consumed by a pipeline operator further down. For instance, if you were pooling a population of size 5, like in the neural network cartpole example, 1 in 6 births would be discarded.

The operators are called once per pipeline execution. That is every evaluation (and subsequent offspring birthing) for steady state, and once per generation for the generational ea. It is possible to bypass this issue for either by just using an even numbered pool, but in the case of steady state that results in an explosion of individuals in queue.

markcoletti commented 1 year ago

This issue is resolved. Closing.

AureumChaos / LEAP

Possible birth counter race condition #244