Distributed High-Performance Symbolic Regression in Julia
Classification on MNIST #141

zsz00 commented 1 year ago

I tried to use sr for mnist classification training, but the results were not good. I hope you can help me see where I need to improve MNIST is 28*28 images, 10 classes label.

using SymbolicRegression
using SymbolicUtils
import MLDatasets: MNIST
import MLUtils: splitobs

function loadmnist(batchsize, train_split)
    ## Load MNIST
    N = 60000  # 5000 
    imgs = MNIST.traintensor(1:N)
    labels_raw = MNIST.trainlabels(1:N)

    ## Process images
    x_data = Float32.(reshape(imgs, size(imgs, 1), size(imgs, 2), 1, size(imgs, 3)))
    y_data = labels_raw   # onehot(labels_raw)
    (x_train, y_train), (x_test, y_test) = splitobs((x_data, y_data); at=train_split)
    return (x_train, y_train), (x_test, y_test)

function train()
    batchsize, train_split = 128, 0.9
    (x_train, y_train), (x_test, y_test) = loadmnist(batchsize, train_split)

    options = SymbolicRegression.Options(
                    binary_operators=(+, *, /, -),
                    unary_operators=(cos, sin, exp),
                    # loss=LogitMarginLoss()
    x_train = reshape(x_train, 784, size(x_train)[end])
    y_train = convert(Vector{Float32}, y_train)

    hall_of_fame = EquationSearch(x_train, y_train, niterations=50, options=options, numprocs=8)

    dominating = calculate_pareto_frontier(x_train, y_train, hall_of_fame, options)
    eqn = node_to_symbolic(dominating[end].tree, options)
    println(simplify(eqn))  # 公式变换/简化

N = 5000 
Complexity  Loss       Score         Equation
18          4.715e+00  8.670e-03  ((((sin(x264 + x320) - -1.9941229) * (1.5529515 - (x484 - sin(sin(x437))))) + x355) - x599)      
19          4.703e+00  2.455e-03  ((((sin(x264 + x320) - -1.9941229) * (1.5095575 - (sin(x484) - sin(sin(x437))))) + x355) - x599) 
20          4.665e+00  8.302e-03  (((((sin(x264 + x320) - -1.9941229) - x509) * (1.5529515 - (x484 - sin(sin(x437))))) + x355) - x599)
N = 60000 

very slow,  get worse results.
MilesCranmer commented 1 year ago

MNIST is a high-dimensional dataset, where pure symbolic regression is going to do quite poorly due to the combinatoric scaling. What you can try instead is something like described in (see interactive example of this at the end of

Basically, write down a neural network like $$classification = MLP1(\sum{i} MLP_2(\text{patch}_i))$$,

where $\text{patch}$ is a patch of pixels (maybe give it 9 pixels?). Once you train this, then try to fit SR to $MLP_2$ and $MLP_1$ independently. Finally, arrange them in the same functional form.

MilesCranmer commented 1 year ago

For example, maybe you'll get something like: $$MLP_2 \approx (\text{pixel}_1 - \text{pixel}_2\ \ ,\ \ \text{pixel}_3 \times \text{pixel}_4)$$ and $$MLP_1 \approx y_1 \times y_2^2$$

Thus, your final equation would be: $$classification = \text{sigmoid}((\sum_{i} \text{pixel}_1 - \text{pixel}_2) \times (\sum_i \text{pixel}_3 \times \text{pixel}_4)^2 )$$

where the sum is over small patches of $3\times 3$ pixels.

tecosaur commented 1 year ago

Regarding applying SymbolicRegression to high-dimensional data sets in general, I imagine the recommendation would be to start with a feature-selection approach, and once a small number of highly-relevant features is selected apply SymbolicRegression?