Simple example of a regression model

azev77 commented 4 years ago

Hi and thank you for this package! I'm having difficulty learning how to apply your package to a regression models w/ a continuous Y (outcome variable).

Is it possible to show a very simple example of how to use your model on the Boston housing data? For reference I have parsimonious Flux code here.

DrChainsaw commented 4 years ago

Thanks alot for shown interest!

I'll try to make an example for the housing data later today or tonight.

There are probably a few missing components for regression (most notably a fitness function), but it should not be too much of a hassle to put something together (at least for someone who knows the package).

DrChainsaw commented 4 years ago

Hi,

This example is perhaps a bit overworked and could probably have been made simpler.

I guess this library is a bit of an overkill for the boston housing dataset and the models returned will most likely be overfitted.


module FitBostonHousing

using NaiveGAflux
using MLDatasets: BostonHousing
using Random
using Statistics

function fit(x = BostonHousing.features(), y = BostonHousing.targets(); seed=1, popsize=128, nepochs=200)
    Random.seed!(seed)
    train,val= iterators(x,y)
    nfeatures = size(x,1)

    fitnessmetric = mse_vs_size(val)

    population = initialpopulation(popsize, nfeatures, fitnessmetric)

    evolution = evolution_strategy(popsize)

    for epoch in 1:nepochs
        @info "Begin epoch $epoch"

        for (i, candidate) in enumerate(population)
            Flux.train!(candidate, train)
            @info "\tFitness candidate $i : $(fitness(candidate))"
        end

        population = evolve!(evolution, population)
    end

    return population

end

function initialpopulation(popsize, inputsize, fitnessmetric)
    # Simple serach space of 1-4 Dense layers with loguniform sizes
    layerspace = VertexSpace(DenseSpace(2 .^ 1:6, [relu, elu, selu]))
    initial_hidden = RepeatArchSpace(layerspace, 0:3)
    # Output layer has fixed size and is shielded from mutation
    outlayer = VertexSpace(Shielded(), DenseSpace(1, identity))
    initial_space = ListArchSpace(initial_hidden, outlayer)
    return  [create_candidate(initial_space, inputsize, fitnessmetric) for _ in 1:popsize]
end

function create_candidate(searchspace, inputsize, fitnessmetric)
    input = inputvertex("input", inputsize, FluxDense())
    model = CompGraph(input, searchspace(input))
    opt = Descent(0.01) # Due to small batch size
    return CacheCandidate(CandidateModel(model, opt, Flux.mse, NanGuard(fitnessmetric)))
end

function evolution_strategy(popsize, nelites=2)
    elite = EliteSelection(2)

    mutate = mutation_strategy()
    evolve = SusSelection(popsize - nelites, mutate)

    combine = CombinedEvolution(elite, evolve)
    return ResetAfterEvolution(combine)
end

function mutation_strategy()
    acts = [relu, elu, selu]

    # Create a shorthand alias for MutationProbability
    mp(p, m) = VertexMutation(MutationProbability(m, p))

    # Mutate number of neurons +/- 10% with 10% probability
    mut_size = mp(0.1, NeuronSelectMutation(NoutMutation(-0.1, 0.1)))
    # Add a layer similar to the ones in the initial search space with 5% probability
    add_layer = mp(0.05, AddVertexMutation(VertexSpace(DenseSpace(2 .^ 1:6, acts))))
    # Remove a layer with 5% probability
    rem_layer = mp(0.05, RemoveVertexMutation())
    # Mutate the activation function with 5% probability
    mut_actfun = mp(0.05, ActivationFunctionMutation(acts))

    return EvolveCandidates(evolvemodel(MutationList(rem_layer, PostMutation(mut_size, NeuronSelect()), mut_actfun, add_layer)))
end

function iterators(x,y, split = [0.9, 0.1], seed = 0)
    s1,s2= cumsum(floor.(Int, split ./ sum(split) .* size(x,2)))

    xs = standardize(x)

    trainitr = iterator(slice((xs,y), 1:s1)...)
    valitr = iterator(slice((xs,y), s1:s2)...)

    return trainitr,valitr
end

standardize(x) = (x .- mean(x)) ./ std(x)

slice(data::Tuple, inds) = slice.(data, [inds])
slice(data::Vector, inds) = data[inds]
slice(data::Matrix, inds) = data[:, inds]

iterator(x,y) = zip(BatchIterator(x, 8), BatchIterator(y, 8))

struct MseFitness{T} <: AbstractFitness
    dataset::T
end
function NaiveGAflux.fitness(s::MseFitness, f)
    acc,cnt = 0, 0
    for (x,y) in s.dataset
        mse = Flux.mse(f(x), y)
        acc += sum(mse)
        cnt += length(mse)
    end
    # Small mse => high fitness
    return cnt / max(acc, 0.1)
end

# Fitness function. Compare MSE on the data provided by itr for each model.
# If MSE (rounded to three digits) is equal between two candidates, take the one with fewer parameters
function mse_vs_size(itr)
    msedigits = 3
    msefitness = MseFitness(itr)
    truncacc = MapFitness(x -> round(100*x, digits=msedigits), msefitness)

    size = SizeFitness()
    sizefit = MapFitness(x -> min(10.0^-(msedigits+1), 1 / 10000x), size) # 10000x because models are typically very small

    return AggFitness(+, truncacc, sizefit)
end

end

To run:

julia> include("FitBostonHousing.jl")

julia> models = FitBostonHousing.fit(seed=0);

azev77 commented 4 years ago

Thanks! I'm gonna have to study your code. I wanna make sure I understand your package a bit better: 1 is this intended to be a self-tuning package for Flux? (like Auto Keras, Auto Pytorch ...) 2 is FluxOptTools only about finding a minimum whereas your package is about HP tuning?

DrChainsaw commented 4 years ago

No problem, just ask if something is unclear about the code. I have the intention to wrap some of the more useful compositions in convenience methods once I have a better idea of what they are :)

Btw, I have not tested whether it produces useful models or not, just that it runs and that the fitness seems to converge.

As for questions:

It's something like those w.r.t what it does, but I don't think I have the manpower/hours to make it equivalent in terms of performance. For now it is more of a sandbox for people who want to play with evolutionary neural architecture search.

One motivation I had for making this package was that I kept reading everywhere that the architecture search space was so large it makes no sense to try to search it but I never saw anyone try. It is ofc extremely large if you don't constrain it, but otoh one of the most appealing things about neural networks is that they are not super sensitive to the architecture hyper parameters as long as you are in the right ballpark. I guess one can say I made it to satisfy my curiosity on that subject. Just as a disclaimer: This package does not really allow for a completely unconstrained search space, but I think it allows for much larger degrees of freedom than I have seen in any of the python packages (eg. Auto Keras).

It is ofc fully possible to reimplement some paper showing good results with a certain search space using this package as a dependency, but it is not something I'm pursuing right now.

As for 2: I don't know much about FluxOptTools but from looking at the readme I get the same impression as you: It is about using a more sophisticated type of optimizer for training the parameters of the model, not for finding the model architecture.

azev77 commented 4 years ago

@DrChainsaw I tried running your FitBostonHousing script but it doesn't seem to work anymore. At least a few things have changed including ListArchSpace(initial_hidden, outlayer)...

DrChainsaw commented 4 years ago

Yeah, I have made a few breaking changes since then.

Here is an updated version which works with version 0.5.1.

At the expense of breivty I added a few more operations. I again want to restate for the record that this is most likely not the best way to fit the boston housing data set, it is just a demonstration of a fair portion of the knobs in NaiveGAflux :)

Let me know if it happens to work well for you and I can probably add a more generic version to the package itself.

module FitBostonHousing

using NaiveGAflux
using MLDatasets: BostonHousing
using Random
using Statistics

function fit(x = BostonHousing.features(), y = BostonHousing.targets(); seed=1, popsize=128, nepochs=200)
    Random.seed!(seed)
    train,val= iterators(x,y)
    nfeatures = size(x,1)

    fitnessmetric = mse_vs_size(val)

    population = initialpopulation(popsize, nfeatures, fitnessmetric)

    evolution = evolution_strategy(popsize)

    for epoch in 1:nepochs
        @info "Begin epoch $epoch"

        for (i, candidate) in enumerate(population)
            Flux.train!(candidate, train)
            #trainerror = sqrt(1 / fitness(MseFitness(train), NaiveGAflux.graph(candidate)))
            # Measure validation error separately as candidate fitness has other components
            valerror = sqrt(1 / fitness(MseFitness(val), NaiveGAflux.graph(candidate)))
            @info "\tcandidate $i val error: $valerror"
        end

        if epoch != nepochs
            population = evolve!(evolution, population)
        end
    end

    return NaiveGAflux.graph.(population)

end

function initialpopulation(popsize, inputsize, fitnessmetric)
    # Simple serach space of 1-4 Dense layers with loguniform sizes
    layerspace = VertexSpace(DenseSpace(2 .^ 1:6, [relu, elu, selu]))
    initial_hidden = RepeatArchSpace(layerspace, 0:3)
    # Output layer has fixed size and is shielded from mutation
    outlayer = VertexSpace(Shielded(), DenseSpace(1, identity))
    initial_space = ArchSpaceChain(initial_hidden, outlayer)
    return  [create_candidate(initial_space, inputsize, fitnessmetric) for _ in 1:popsize]
end

function create_candidate(searchspace, inputsize, fitnessmetric)
    input = inputvertex("input", inputsize, FluxDense())
    model = CompGraph(input, searchspace(input))
    opt = ADAM(0.01) # Due to small batch size
    return CacheCandidate(CandidateModel(model, opt, Flux.mse, NanGuard(fitnessmetric)))
end

function evolution_strategy(popsize, nelites=2)
    # The nelites best candidates will not be changed, just trained some more
    elite = EliteSelection(nelites)

    # Select popsize-nelites canidates to crossover and mutate using tournament selection with tournament size 2
    evolve = TournamentSelection(popsize - nelites, 2, 1.0, crossovermutate())

    # This will just combine the elites and the evolved population back into one population with size popsize
    combine = CombinedEvolution(elite, evolve)
    return ResetAfterEvolution(combine)
end

function crossovermutate(;pcrossover=0.3, pmutate=0.9)
    # Crossover of both models and the optimizer used to fit them
    cross = candidatecrossover(pcrossover)
    crossoverevo = PairCandidates(EvolveCandidates(cross))

    # Mutatation of both models and the optimizer used to fit them
    mutate = candidatemutation(pmutate)
    mutationevo = EvolveCandidates(mutate)

    return EvolutionChain(crossoverevo, mutationevo)
end

# Crossover of models (graphs) and optimizers
candidatecrossover(p) = evolvemodel(MutationProbability(graphcrossover(), p), optcrossover())
# Mutation of models (graphs) and optimizers
candidatemutation(p) = evolvemodel(MutationProbability(graphmutation(), p), optmutation())

function optcrossover(poptswap=0.3, plrswap=0.4)
    # Swap learning rates between two optimizers with probability plrswap
    lrc = MutationProbability(LearningRateCrossover(), plrswap) |> OptimizerCrossover
    # Swap two whole optimizers
    oc = MutationProbability(OptimizerCrossover(), poptswap) |> OptimizerCrossover
    return MutationChain(lrc, oc)
end

function optmutation(p=0.05)
    # Mutate learning rate of an optimizer
    lrm = LearningRateMutation()
    # Change optimizer type and keep learning rate with probability p
    om = MutationProbability(OptimizerMutation([Descent, Momentum, Nesterov, ADAM, NADAM, ADAGrad]), p)
    return MutationChain(lrm, om)
end

function graphcrossover()
    # Swap a sequence of layers from one graph to another
    vertexswap = CrossoverSwap(0.05)
    # For each layer in a graph swap it with another graph with a 5% probability
    return VertexCrossover(MutationProbability(vertexswap, 0.05), 0.05)
end

function graphmutation()
    acts = [relu, elu, selu]

    # Create a shorthand alias for MutationProbability
    mp(p, m) = VertexMutation(MutationProbability(m, p))

    # Mutate number of neurons +/- 10% with 10% probability
    mut_size = mp(0.1, NeuronSelectMutation(NoutMutation(-0.1, 0.1)))
    # Add a layer similar to the ones in the initial search space with 5% probability
    add_layer = mp(0.05, AddVertexMutation(VertexSpace(DenseSpace(2 .^ 1:6, acts))))
    # Remove a layer with 5% probability
    rem_layer = mp(0.05, RemoveVertexMutation())
    # Mutate the activation function with 5% probability
    mut_actfun = mp(0.05, ActivationFunctionMutation(acts))

    return MutationChain(rem_layer, PostMutation(mut_size, neuronselect), mut_actfun, add_layer)
end

# Should ideally be handled by MLDataUtils or similar package
function iterators(x,y, split = [0.8, 0.2], seed = 0)
    s1,s2= cumsum(floor.(Int, split ./ sum(split) .* size(x,2)))

    xs = standardize(x)

    trainitr = iterator(slice((xs,y), 1:s1)...)
    valitr = iterator(slice((xs,y), s1:s2)...)

    return trainitr,valitr
end

standardize(x) = (x .- mean(x; dims=2)) ./ std(x; dims=2)

slice(data::Tuple, inds) = slice.(data, [inds])
slice(data::Vector, inds) = data[inds]
slice(data::Matrix, inds) = data[:, inds]

iterator(x,y) = zip(BatchIterator(x, 8), BatchIterator(y, 8))

struct MseFitness{T} <: AbstractFitness
    dataset::T
end
function NaiveGAflux.fitness(s::MseFitness, f)
    acc,cnt = 0, 0
    for (x,y) in s.dataset
        mse = (f(x) - y) .^ 2
        acc += sum(mse)
        cnt += length(mse)
    end
    # Small mse => high fitness
    return cnt / max(acc, 0.1)
end

# Fitness function. Compare MSE on the data provided by itr for each model.
# If MSE (rounded to three digits) is equal between two candidates, take the one with fewer parameters
function mse_vs_size(itr)
    msedigits = 3
    # Training is very noisy. Use EWMA filtering to smooth it out
    msefitness = EwmaFitness(0.1, MseFitness(itr))
    truncacc = MapFitness(x -> round(100*x, digits=msedigits), msefitness)

    size = SizeFitness()
    # Size fitness will just return the number of parameters as the fitness value.
    # We want a low number of parameters to be indicative of good fitness
    sizefit = MapFitness(x -> min(10.0^-(msedigits+1), 1 / 10000x), size)

    return AggFitness(+, truncacc, sizefit)
end

end

azev77 commented 4 years ago

@DrChainsaw Thank you! It works now. After fit how can I predict from the "best" model on new data?

models = fit(seed=0,popsize=5, nepochs=2)
ŷ = predict(models[1], X_test)
score = mape(ŷ, y_test)

DrChainsaw commented 4 years ago

Oops, forgot that part :)

julia> models = NaiveGAflux.graph.(models);

julia> using MLDatasets: BostonHousing

julia> x = FitBostonHousing.standardize(BostonHousing.features()); #This just happens to be what is done in code above. In a real application you might want to do this differently to avoid train->test leakage.

julia> models[1](x)
1×506 Array{Float32,2}:
 27.3557  22.6819  30.6625  27.8783  27.6214  22.448  18.9768  …  18.0694  24.9557  22.2386  29.7275  27.891  22.0147

julia> mapreduce(model -> model(x), +, models[1:10]) ./ 10 # Why not use a few of the models in an ensemble
1×506 Array{Float32,2}:
 24.0295  19.8521  26.1882  24.6596  24.2919  20.5495  16.7767  …  15.3688  20.7122  18.6924  23.9985  22.6324  18.4841

I have edited the examble above to return the actual models instead of the population of candidates (where a candidate holds the model as well as the optimizer, loss and fitness function).

azev77 commented 4 years ago

Thanks, it works now. I'm gonna have to tune it before it works well...

DrChainsaw commented 4 years ago

Great! Please file another issue (or add to this one) if you find anything which could be improved.

DrChainsaw / NaiveGAflux.jl

Simple example of a regression model #33