Why the result of Flux.jl is totally different from tf.Keras (with the same simple MLP)

Dear all,

I want to use Flux.jl to build a simple Multi-Layer Perceptron (MLP) as I did in Keras, where the input data is a matrix of nGene (number of genes) by nInd (number of individuals), output data is a vector of length nInd to represent a trait (e.g. height). I also have two hidden layers with 64, 32 neurons, respectively.

In summary, the number of neurons is changed as: nGene --> 64 --> 32 --> 1

In Keras, the MLP is:

# Instantiate
model = Sequential()

# Add first layer
model.add(Dense(64, input_dim=nGene))
model.add(Activation('relu'))
# Add second layer
model.add(Dense(32))
model.add(Activation('softplus'))
# Last, output layer
model.add(Dense(1))

# compile(optimizer, loss=None, metrics=None, loss_weights=None, sample_weight_mode=None, weighted_metrics=None, target_tensors=None)
model.compile(loss='mean_squared_error', optimizer='adam') 

# fit(x=None, y=None, batch_size=None, epochs=1, verbose=1, callbacks=None, validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, validation_steps=None, validation_freq=1)
model.fit(X_train, y_train, epochs=100)

From below, the loss (mse) of each epoch is less than one. The prediction accuracy of testing data is about 0.6, which is good.

In Flux.jl, I built the same MLP by:

data = Iterators.repeated((X_train_t, Y_train), 100)

model = Chain(
  Dense(nGene, 64, relu),
  Dense(64, 32, softplus),
  Dense(32, 1))

loss(x, y) = Flux.mse(model(x), y)
ps = Flux.params(model)
opt = ADAM() 
evalcb = () -> @show(loss(X_train_t, Y_train))

Flux.train!(loss, params(model), data, opt, cb = evalcb)

Here X_train_t is a nGene by nInd matrix, Y_train is a vector of length nInd.

The loss is very very high, and the prediction accuracy of testing data is almost zero.

BTW, in Flux.jl, if I change the optimizer to gradient descent, it even didn't converge.

Some extra things but not helpful for my issue:

The default step size and other parameters in Flux are the same as in Keras for the Adam optimizer.
Even if the mean squared error is calculated in a different way, I don’t think it will result in such a bad prediction accuracy in Flux.jl.
In Flux.jl, the input data is a matrix of #genes by #samples. I followed the tutorial of MNIST example, where the input data is a matrix of #pixel by #samples. If I transposed the data in another way, I even cannot run the Flux code.
The elements of the input matrix are either 0 or 1, so I didn’t normalize it in Flux. And I didn’t find that Keras do normalization automatically.

I really don't know why the training process from Flux.jl is wrong, could you please tell me what's wrong with my Flulx code?

Thank you very much, Tianjing

FluxML / Flux.jl

Why the result of Flux.jl is totally different from tf.Keras (with the same simple MLP) #953