MilesCranmer / SymbolicRegression.jl

Distributed High-Performance Symbolic Regression in Julia
https://astroautomata.com/SymbolicRegression.jl/dev/
Apache License 2.0
576 stars 71 forks source link

Classification on MNIST #141

Open zsz00 opened 1 year ago

zsz00 commented 1 year ago

I tried to use sr for mnist classification training, but the results were not good. I hope you can help me see where I need to improve MNIST is 28*28 images, 10 classes label.

using SymbolicRegression
using SymbolicUtils
import MLDatasets: MNIST
import MLUtils: splitobs

function loadmnist(batchsize, train_split)
    ## Load MNIST
    N = 60000  # 5000 
    imgs = MNIST.traintensor(1:N)
    labels_raw = MNIST.trainlabels(1:N)

    ## Process images
    x_data = Float32.(reshape(imgs, size(imgs, 1), size(imgs, 2), 1, size(imgs, 3)))
    y_data = labels_raw   # onehot(labels_raw)
    (x_train, y_train), (x_test, y_test) = splitobs((x_data, y_data); at=train_split)
    return (x_train, y_train), (x_test, y_test)
end

function train()
    batchsize, train_split = 128, 0.9
    (x_train, y_train), (x_test, y_test) = loadmnist(batchsize, train_split)

    println(size(x_train))
    options = SymbolicRegression.Options(
                    binary_operators=(+, *, /, -),
                    unary_operators=(cos, sin, exp),
                    npopulations=50,
                    batching=true,
                    batchSize=100,
                    # loss=LogitMarginLoss()
                    )
    x_train = reshape(x_train, 784, size(x_train)[end])
    y_train = convert(Vector{Float32}, y_train)

    hall_of_fame = EquationSearch(x_train, y_train, niterations=50, options=options, numprocs=8)

    dominating = calculate_pareto_frontier(x_train, y_train, hall_of_fame, options)
    eqn = node_to_symbolic(dominating[end].tree, options)
    println(simplify(eqn))  # 公式变换/简化
end

train()
N = 5000 
batching=false,
out:
Complexity  Loss       Score         Equation
18          4.715e+00  8.670e-03  ((((sin(x264 + x320) - -1.9941229) * (1.5529515 - (x484 - sin(sin(x437))))) + x355) - x599)      
19          4.703e+00  2.455e-03  ((((sin(x264 + x320) - -1.9941229) * (1.5095575 - (sin(x484) - sin(sin(x437))))) + x355) - x599) 
20          4.665e+00  8.302e-03  (((((sin(x264 + x320) - -1.9941229) - x509) * (1.5529515 - (x484 - sin(sin(x437))))) + x355) - x599)
-------------
N = 60000 
batching=true,

very slow,  get worse results.
MilesCranmer commented 1 year ago

MNIST is a high-dimensional dataset, where pure symbolic regression is going to do quite poorly due to the combinatoric scaling. What you can try instead is something like described in https://arxiv.org/abs/2006.11287 (see interactive example of this at the end of https://colab.research.google.com/github/MilesCranmer/PySR/blob/master/examples/pysr_demo.ipynb).

Basically, write down a neural network like $$classification = MLP1(\sum{i} MLP_2(\text{patch}_i))$$,

where $\text{patch}$ is a patch of pixels (maybe give it 9 pixels?). Once you train this, then try to fit SR to $MLP_2$ and $MLP_1$ independently. Finally, arrange them in the same functional form.

MilesCranmer commented 1 year ago

For example, maybe you'll get something like: $$MLP_2 \approx (\text{pixel}_1 - \text{pixel}_2\ \ ,\ \ \text{pixel}_3 \times \text{pixel}_4)$$ and $$MLP_1 \approx y_1 \times y_2^2$$

Thus, your final equation would be: $$classification = \text{sigmoid}((\sum_{i} \text{pixel}_1 - \text{pixel}_2) \times (\sum_i \text{pixel}_3 \times \text{pixel}_4)^2 )$$

where the sum is over small patches of $3\times 3$ pixels.

tecosaur commented 1 year ago

Regarding applying SymbolicRegression to high-dimensional data sets in general, I imagine the recommendation would be to start with a feature-selection approach, and once a small number of highly-relevant features is selected apply SymbolicRegression?