Grouped Coevolution - Githubissues

SigmaX commented 3 years ago

The grouped evaluation mechanism implemented in #123 works great, but coevolution (CooperativeEvaluate) doesn't make use of it. I need coevolution + grouped evaluation for an application involving populations that are evaluated in parallel on a GPU.

Tweak it so it does, or rather, can when requested.

SigmaX commented 3 years ago

A couple ways I could go here.

Add a CooperativeEvaluate.grouped() function, that works like CooperativeEvaluate.__call__() but on chunks instead of single individuals.
Refactor coevolution to use a special Problem wrapper, instead of its own evaluation operator. This way grouped evaluation would work the same as anywhere else—by choosing the standard evaluation operator. Perhaps the cleaner option?

SigmaX commented 2 years ago

To summarize:

Coevolution currently works by using a special ops.CooperativeEvaluate() operator.
- Like most operators, this is an iteriter_op (takes an iterator, returns an iterator; works on one individual at a time).
Grouped evaluation currently works by using a special ops.grouped_evaluate() operator.
- This is a listlist_op; i.e. it takes a list, returns a list—which is what allows it to operate on multiple individuals simultaneously (akin to, say, truncation selection).
- It relies on Problem offering an evaluate_multiple() function (the default just calls evaluate() in a for loop, but we write special ones to send groups of individuals to a GPU for parallel evaluation).

We can't really do grouped evaluation with an iteriter_op. So we definitely need a new operator here.

Idea: how about a straightforward listlist_op version of CooperativeEvaluate? It would be natural to have it call grouped_evaluate() as a subroutine, allowing grouped evaluation logic to be enabled by custom Problem implementations. This would be approach (1) mentioned in my previous comment.

My alternative idea (2) was a Problem wrapper—say, CooperativeProblem. To make Problem responsible for coevolutionary logic, we would need to give it an interface such that you can hand it a partial individual, tell it which subpopulation that partial solution belongs to, and also give it access to the current population as a whole so it could go and find collaborators to construct full solutions.

I think the way to do this, while respecting the Problem interface (which takes just a phenome as input to its evaluate() method, no other arguments) is to tell the CooperativeProblem at construction time which subpopulation it will receive partial solutions from.

This suggests an arguably elegant (or at least intuitive) view of cooperative coevolution: we will have several subpopulations, each of which will have its own fitness function (a CooperativeProblem instance), configured specifically for that subproblem. It just happens that these fitness functions are a function of other populations.

Otherwise, it behaves much like, say, a hetereogeneous island model (sans migration).

To my surprise, I actually like idea (2). To implement it... let's see...

Heterogeneous island models (our multitask_island_model.py example) use a "customs stamp" function during migration to update an individual's reference to its Problem when they join a new deme.
For coevolution, individuals' problems don't change. All we need to do is ensure that they are assigned the correct CooperativeProblem at initialization time. multi_population_ea already supports this.
- The tricky concept is that CooperativeProblem will need to know how to access the other subpopulations while the EA is running. We already solved this for CooperativeEvaluate, though (which has the same need!) by having multi_population_ea place a reference to the full population in the context object. So the logic here really is no different.

Circling back: besides an elegant view of coevolution as "multiple sub-populations with their own (interdependent) fitness functions," what does this buy us?

How would we do grouped evaluation in this system (the whole point of this issue)? By providing a multiple_evaluation() method on CooperativeProblem. So it's enabled by ops.grouped_evaluate(), just like any other grouped evaluation application.
Does this buy us other flexibility? I think so, but am not sure. While it's not clear why you would want to, one thing EA frameworks struggle with are things like running island models and coevolution at the same time. This wouldn't give us that out of the box, and questions would need answered (what does it mean to "migrate" and "individual" between two structured populations?)—but it seems that maybe this is more loosely coupled and easier to fit into that concept?

tl;dr:

(1) is definitely simpler and meets my immediate need.

(2) is not that complicated, and has an arguable elegance about it. Hmm.

SigmaX commented 2 years ago

Complication:

I started implementing (2). It's mostly a straightforward refactor, converting our existing CooperativeEvaluate operator into a new CooperativeProblem class that contains the same logic.

But a Problem takes a phenome as input. In coevolution, typically we want to combine genomes. (This has me realizing that one might want to do either one: combine at the genotypic level, or at the phenotypic level.)

The problem is that, if we only support one, genotypic recombinations are most important and standard. But I'm not sure this is possible with our Problem, since it doesn't take a genome.

Idea: this suggests to me that coevolution is most naturally modeled as a decoding procedure: during the genotype-to-phenotype conversion, we take other subpopulations into account, converting a partial genotype into a full phenotype.
Most users will find that an odd way to think about it, though. Thinking about it as a fitness function defined over (partial_genome, population) pairs is far more intuitive.

Options:

Back off and implement the less general design (1) instead.
Continue by combining phenomes, documenting it to point out that things will likely blow up if you aren't using an IdentityDecoder, and that this can't be used with, say, our genetic programming or neural network representations.
... some third way I can have my cake and eat it too?

SigmaX commented 2 years ago

Picking this back up after a detour in #191.

Third way followed: #191 refactors the Problem interface to take an Individual instead of just its phenome. This gives me the flexibility to implement a coevolutionary Problem that an combine individuals however I want.

SigmaX commented 2 years ago

Implementation complete and tests/example are passing.

I just want to make sure the resulting algorithm behaves the same as the old one before merging and closing this issue.

SigmaX commented 2 years ago

Collecting some data for a regression test:

for i in $(seq 0 99); do
    echo ${i};
    python ../examples/advanced/coevolution_via_fitness_functions.py > coevolution_via_problem_run${i}.csv;
done

And the old version:

for i in $(seq 0 99); do
    echo ${i};
    python ../examples/advanced/coevolution.py > coevolution_via_operator_run${i}.csv;
done

SigmaX commented 2 years ago

Interestingly, the Problem-based implementation appears to run much faster than the CooperativeEvaluate operator implementation. I'm not sure why that is.

SigmaX commented 2 years ago

Behavior checks out: the new coevolution behaves like the old one in term of mean fitness in each subpopulation:

Script I used to analyze the data:

%%bash
mkdir -p preprocessed/
for f in *.csv; do
   cat ${f} \
       | sed -E 's/\[|\]//g' \
       | sed 's/subpop_bsf/subpop_0, subpop_1, subpop_2, subpop_3/g' \
       > preprocessed/${f}
done

from glob import glob
import re

from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns

plt.style.use('ggplot')

##### Load the data
def get_runs(version: str):
    """Load all of the files for our single-task runs into a single dataframe."""

    def load_file(f):
        """Load a single file into a dataframe."""
        df = pd.read_csv(f, skipinitialspace=True, comment='#')

        # Tet the job id from the file name
        job_finds = re.findall('_run([0-9]*).csv', f)
        assert(len(job_finds) == 1)
        job = job_finds[0]
        df['job'] = job

        # Correct the paradigm column (since we gave it the wrong value in the experiment)
        df['version'] = version

        return df

    #One file per *run* (containing all tasks)
    pattern = f"preprocessed/coevolution_via_{version}_run*.csv"
    files = glob(pattern)
    assert(len(files) > 0), f"No files found for pattern '{pattern}'."

    dfs = [ load_file(f) for f in files ]
    df = pd.concat(dfs)

    #assert(len(df) == 100*2001), f"Got {len(df)} rows total, but expected {100*2000}."
    #assert(len(df.job.unique()) == 100)
    assert(len(df.generation.unique()) == 2001), f"Expected {2001} different generations, but got {len(df.generation.unique())}: {df.generation.unique()}."

    return df.reset_index(drop=True)

# Example
#df = get_runs('problem')
#df

# Wide to long
df = pd.concat([get_runs('problem'), get_runs('operator')]).reset_index()
df = pd.melt(df, id_vars=['job', 'generation', 'version'], value_vars=['subpop_0', 'subpop_1', 'subpop_2', 'subpop_3',])
# df

# Plot
plt.figure(figsize=(12, 8))
sns.lineplot(data=df[df.generation < 50],
             x='generation',
             y='value',
             hue='version',
             style='variable')
#plt.ylim(10, 20)
plt.yscale('log')

SigmaX commented 2 years ago

Reopening because Kexin encountered two issues:

The CooperativeProblem class seems to inherit the evaluate_multiple function from the Problem class, which evaluates a group of individuals sequentially, rather than in parallel.
When I tried to add a log stream to the “coevolution_via_fitness_functions.py” example, I encountered the error below. It seems like the _log_trial function is expecting all_collaborators to be a list of Individual objects, but the function _choose_collaborators returns a list of genomes, which causes a mismatch.

  File "/home/kexin/LEAP/leap_ec/problem.py", line 472, in evaluate
    self._log_trial(
  File "/home/kexin/LEAP/leap_ec/problem.py", line 516, in _log_trial
    'genome'                    : collab.genome,
AttributeError: 'numpy.ndarray' object has no attribute 'genome'

SigmaX commented 2 years ago

Looking at (1): should be easy to fix. In Kexin's application, CooperativeProblem.wrapped_problem is an instance of ExternalProcessProblem (which interfaces with CARLsim). I just need to write a CooperativeProblem.evaluate_multiple() function that collects combined phenomes for all individuals in a subpopulation at once (using the same logic as CooperativeProblem.evaluate().

SigmaX commented 2 years ago

@kexinchenn identified another bug: it seems that individuals are not correctly being assigned fitness values.

In both examples/advances/coevolution.py and examples/advances/coevolution_via_fitness_functions.py, if I instrument the ops.random_selection operator to print out the fitnesses of collaborators at the moment that they are selected, they all have the initial arbitrary fitness value of -100:

    Chose individual [1 0 0] -100, fitness: -100
    Chose individual [1 0 1 1] -100, fitness: -100
    Chose individual [0 0 0 1 0] -100, fitness: -100
    Chose individual [1 1 0] -100, fitness: -100
    Chose individual [1 1 1 1] -100, fitness: -100
    Chose individual [0 0 0 0 1] -100, fitness: -100
    Chose individual [0 1 0] -100, fitness: -100
    Chose individual [1 0 1 1] -100, fitness: -100
    Chose individual [0 1 0 1 0] -100, fitness: -100
    Chose individual [1 1 1] -100, fitness: -100
    Chose individual [1 0 1 0] -100, fitness: -100
    Chose individual [0 0 1 1 1] -100, fitness: -100
    Chose individual [1 0 0] -100, fitness: -100
    Chose individual [0 0 1 0] -100, fitness: -100
    Chose individual [1 1 1 1 1] -100, fitness: -100
    Chose individual [1 0 0] -100, fitness: -100
    Chose individual [0 1 0 0] -100, fitness: -100
    Chose individual [0 1 0 1 0] -100, fitness: -100
    Chose individual [0 1 0] -100, fitness: -100
    Chose individual [1 1 0 0] -100, fitness: -100
    Chose individual [1 0 1 1 0] -100, fitness: -100
    Chose individual [1 1 1] -100, fitness: -100
    Chose individual [0 0 1 1] -100, fitness: -100
    Chose individual [0 1 0 1 0] -100, fitness: -100
14, [13, 13.666666666666666, 15, 14.333333333333334]

The last line is the generate boundary, and we do see normal fitness values there.

This suggests that perhaps fitnesses for combined individuals are being calculated correctly, but fitnesses for partial solutions within each subpopulation are not being assigned...

SigmaX commented 2 years ago

Debugging.

I'm seeing fitness values in the subpopulation updated correctly at the end of each generation (governed by line 308 in the following, which is in the main loop of multi_population_ea()):

But when we drill down into CooperativeEvaluate, at the moment where it looks at context to grab a reference to the subpopulations, the fitness values are all -100 again:

So it seems that something is happening that resets the fitnesses (or the references to the subpops?) in between the generation boundary and when we run the CoevolutionaryEvaluate operator...

SigmaX commented 2 years ago

So, still looking at coevolution.py, the pipeline is

# Operator pipeline
shared_pipeline=[
   ops.tournament_selection,
   ops.clone,
   mutate_bitflip(expected_num_mutations=1),
   ops.CooperativeEvaluate(
       num_trials=3,
       collaborator_selector=ops.random_selection,
       log_stream=log_stream),
   ops.pool(size=pop_size)
]

The only smoke I can find is that clone() always resets fitness. But it sets things to None, not -100, and this effects only the subpopulation currently being processed (which is fine!).

It seems as if the collaborator selection operators in CooperativeEvaluate is being bound to the original, initial population instead of the updated population from context. I can't see yet where that initial population could be being copied and kept, however.

SigmaX commented 2 years ago

Debugging crumb:

The population found in context has all -100 fitnesses even at the time the tournament_selection operator is executed.

SigmaX commented 2 years ago

Found it. Stupid bug in multi_population_ea. We have a pops variable that is supposed to point to the same thing as context, but when evaluating the initial population we overwrite the reference—from that point on, the two references point to different lists.

SigmaX commented 2 years ago

Fix is simple (just reorder the lines of code).

Writing a unit test to protect against regression will take some thought.

AureumChaos / LEAP

Grouped Coevolution #168