Closed Manas2030 closed 3 years ago
Here are some screenshots for reference: Training on the whole dataset, without making any groups Training on whole dataset but dividing them into 5 groups Training on the first 30 points and again dividing them into 5 groups
function loss_neuralode(p)
pred = predict_neuralode(p)
tot_loss = 0
x = 5
for i in 1:x
loss = sum(abs, (ode_data[Int((i-1)*(datasize/x)+1):Int(i*(datasize/x))] .- pred[Int((i-1)*(datasize/x)+1):Int(i*(datasize/x))]))
tot_loss += loss*loss
end
return tot_loss, pred
end
These aren't doing multiple shooting. pred = predict_neuralode(p)
is the essence of single shooting. It needs to make predictions on the subintervals. You're still only doing a single solve.
I tried implementing the loss_neuralode function like this:
function predict_neuralode(u_initial,starting_timestep,p)
Array(prob_neuralode(u_initial, p))
end
function loss_neuralode(p)
tot_loss = 0
#dividing the dataset in groups of 10 i.e. 1-10, 11-20, ... , 41-50
for i in [10,20,30,40,50]
pred = predict_neuralode(Float32[ode_data[i-9]],tsteps[i-9],p)
loss = sum(abs, (ode_data[i-9:i] .- pred[1:10]))
tot_loss += loss*loss
end
pred = predict_neuralode(u0,tsteps[1],p)
return tot_loss, pred
end
To do multiple shooting, I am sending the first element of the group which I divided the dataset into (u_initial), timestep of that u_initial (starting_timestep) and parameters of NN (p). But i cannot figure out how to use the parameter 'starting_timestep' as the timestep from which the NN should start predicting in function predict_neuralode
. The function prob_neuralode()
seems to always start predicting from timestep 0.0f
Use the direct interface, i.e.
dudt2 = FastChain((x, p) -> x.^3,
FastDense(2, 50, tanh),
FastDense(50, 2))
neural_ode_f(u,p,t) = dudt2(u,p)
pinit = initial_params(dudt2)
prob = ODEProblem(neural_ode_f, u0, tspan, pinit)
See the bottom of https://diffeqflux.sciml.ai/dev/examples/neural_ode_sciml/. That will give you more control. We can then find out how to change the NeuralODE
struct. It will need to be changed in order to do this, but I kind of think it would be good to find out not only that, but a better interface for multiple shooting.
These are the functions I have modified
function predict_neuralode(u_initial, timesteps, p)
tmp_prob = remake(prob, p=p, tspan=(timesteps[1],timesteps[length(timesteps)]), u0=u_initial)
tmpq = Array(solve(tmp_prob, Tsit5(), saveat=timesteps))
return tmpq
end
function loss_neuralode(p)
tot_loss = 0
# dividing the dataset in groups of 10 i.e. 1-10, 11-20, ...
# grp_size = 10, datasize = 20
for i in grp_size:grp_size:datasize
pred = predict_neuralode(Float32[ode_data[i-grp_size+1]],tsteps[i-grp_size+1:i],p)
loss = sum(abs2, (ode_data[i-grp_size+1:i] .- pred[1:grp_size]))
tot_loss += (loss*loss) + (loss_multiplier * abs(ode_data[i-grp_size+1]-pred[1]))
end
pred = predict_neuralode(u0,tsteps,p)
return tot_loss, pred
end
These are the predictions I got from the NN in first iteration- For first 10 points (1st group):
Float32[0.0, 0.013333219, 0.026126323, 0.038400874, 0.050177608, 0.061476488, 0.07231665, 0.08271681, 0.09269449, 0.1022668]
For points from 11 to 20 (2nd group):
Float32[0.52629834, 0.5184088, 0.51083046, 0.5035515, 0.49656034, 0.48984596, 0.4833976, 0.477205, 0.4712582, 0.46554756]
When initial time is given from t=tsteps[11] itself in the predict_neuralode function (to do multiple shooting), prediction is 0.52629834. But if I take my initial time = tsteps[0] then my prediction is different for 11th timestep (as shown in the graph).
So we need to figure out a way so that we can pass certain timestep and initial conditions for that timestep as well to our NN and get predictions. But I couldn't find any functions relevant to this or interfaces that can help us.
When initial time is given from t=tsteps[11] itself in the predict_neuralode function (to do multiple shooting), prediction is 0.52629834. But if I take my initial time = tsteps[0] then my prediction is different for 11th timestep (as shown in the graph).
Indeed, that continuity will only hold after the loss function is minimized, and if there is a term forcing continuity. How large is loss_multiplier ? This might be dominated by the squared term.
I tried loss_multiplier on several values like 0, 10, 100, 10000. But it doesn't work because when I call the predict_neuralode
in second iteration of for i in grp_size:grp_size:datasize
, the parameters are (ode_data[11],tsteps[11],p)
. And when the problem is remade inside the predict_neuralode
with passed parameters as the initial conditions, the prediction at t = 0 for the new problem (which is t = 11 for our original problem) is always 0.52629834 (So tmpq[0] = ode_data[11] = 0.52629834). So inside the loss_neuralode
no matter how big is loss_multiplier, abs(ode_data[11]-pred[1]))
is always = 0. Therefore loss_multiplier is not doing anything.
Oh that term is incorrect. It should be abs(ode_data[i+grp_size]-pred[end]))
. The end of the prediction should equation the initial condition of the next. That term is only important when you start to make the initial conditions along the trajectory parameters themselves though.
I changed the grp_size
to 5 and this is the new function:
function loss_neuralode(p)
tot_loss = 0
for i in grp_size:grp_size:datasize
pred = predict_neuralode(Float32[ode_data[i-grp_size+1]],tsteps[i-grp_size+1:i],p)
loss = sum(abs2, (ode_data[i-grp_size+1:i] .- pred[1:grp_size]))
if(i == datasize)
tot_loss += (loss*loss)
else
tot_loss += (loss*loss) + (loss_multiplier * abs(ode_data[i+ grp_size]-pred[grp_size]))
end
end
pred = predict_neuralode(u0,tsteps,p)
return tot_loss, pred
end
Changing loss_multiplier
didn't have much of an effect and this is the output I am getting.
Also if I do abs(ode_data[i+ grp_size]-pred[grp_size])
wouldn't it calculate distance between first point of next group and prediction of last point of current group which could be really far if we have a high slope in between?
Also if I do abs(ode_data[i+ grp_size]-pred[grp_size]) wouldn't it calculate distance between first point of next group and prediction of last point of current group which could be really far if we have a high slope in between?
They should overlap at a point.
Take it slow: if there's one group does it work? 2?
These are the predictions: with one group [grp_size = 20]
With 2 grps [grp_size = 10]
Also I tested it on 5 groups since with only 2 grps the 2nd grp will go in the if block if(i == datasize) tot_loss += (loss*loss)
and the term (loss_multiplier * abs(ode_data[i+ grp_size]-pred[grp_size]))
wouldn't be added.
wait, why are you indexing the ODE data linearly when it's a matrix?
The ode_data is of type Array{Float32,1} so I have been accessing it linearly like a normal 1D array. I didn't quite get what you meant š
Oh I see. Does this example solve fine "the normal way"?
No it still gets stuck in a local minima. The predictions are very very similar to when we take grp_size = 20 (the whole dataset). There's a picture up there too with the same grp_size.
Why is there an (x, p) -> x.^3,
in front if you're trying to fit f(x) = xsin(3x)
? That doesn't seem like a good idea.
I would suggest not attempting to write new methods on new examples: start with existing examples to debug.
Hi! So I shifted my focus to this already worked problem: https://diffeqflux.sciml.ai/dev/examples/local_minima/ And this is the code I am working on:
using DiffEqFlux, OrdinaryDiffEq, Flux, Optim, Plots
u0 = Float32[2.0; 0.0]
datasize = 30
tspan = (0.0f0, 5.0f0)
tspan_mini = (0.0f0, 1.5f0)
datasize_mini = 10
tsteps = range(tspan[1], tspan[2], length = datasize)
function trueODEfunc(du, u, p, t)
true_A = [-0.1 2.0; -2.0 -0.1]
du .= ((u.^3)'true_A)'
end
prob_trueode = ODEProblem(trueODEfunc, u0, tspan)
ode_data = Array(solve(prob_trueode, Tsit5(), saveat = tsteps))
iter = 0
grp_size = 5
loss_multiplier = 1000
dudt2 = FastChain((x, p) -> x.^3,
FastDense(2, 16, tanh),
FastDense(16, 2))
neural_ode_f(u,p,t) = dudt2(u,p)
pinit = initial_params(dudt2)
prob = ODEProblem(neural_ode_f, u0, tspan_mini, pinit)
prob_neuralode = NeuralODE(dudt2, (0.0,1.6), Tsit5(), saveat = tsteps[tsteps .<= 1.6])
function predict_neuralode(u_initial, timesteps, p)
tmp_prob = remake(prob, p=p, tspan=(timesteps[1],timesteps[length(timesteps)]), u0=u_initial)
tmpq = Array(solve(tmp_prob, Tsit5(), saveat=timesteps))
return tmpq
end
function loss_neuralode(p)
pred = predict_neuralode(u0,tsteps[tsteps .<= 1.6],p)
loss = sum(abs2, (ode_data[:,1:size(pred,2)] .- pred))
return loss, pred
end
iter = 0
callback = function (p, l, pred; doplot = true)
global iter
iter += 1
display(l)
if doplot
# plot current prediction against data
plt = scatter(tsteps[1:size(pred,2)], ode_data[1,1:size(pred,2)], label = "data")
scatter!(plt, tsteps[1:size(pred,2)], pred[1,:], label = string("pred: ",iter))
display(plot(plt))
end
return false
end
iter = 0
result_neuralode = DiffEqFlux.sciml_train(loss_neuralode, prob_neuralode.p,
ADAM(0.05), cb = callback,
maxiters = 300)
function final_loss(p)
return sum(abs2, (ode_data[:,1:10] .- Array(prob_neuralode(u0, p)) ))
end
print("Final Loss: ",final_loss(result_neuralode.minimizer),"\n")
Here is the result with single shooting: Loss: 1.4008017
For multiple shooting, here is the loss_neuralode
function:
function loss_neuralode(p)
tot_loss = 0
for i in grp_size:grp_size:datasize_mini
pred = predict_neuralode(ode_data[:,i-grp_size+1],tsteps[i-grp_size+1:i],p)
loss = sum(abs2, (ode_data[:,i-grp_size+1:i] .- pred[:,1:grp_size]))
if(i == datasize)
tot_loss += (loss*loss)
else
tot_loss += (loss*loss) + (loss_multiplier * sum(abs,(ode_data[:,i+ grp_size]-pred[:,grp_size])))
end
end
pred = predict_neuralode(u0,tsteps[tsteps .<= 1.6],p)
return tot_loss, pred
end
Here is the result with this function: Loss: 44.763756
Replacing the difference term ode_data[:,i+ grp_size]-pred[:,grp_size]
with ode_data[:,i+grp_size+1] - pred[:,1]
and also removing the if condition, this is the result:
Loss: 2.431281
So I think the second difference: ode_data[:,i+grp_size+1] - pred[:,1]
does better at predicting?
If you run it longer does it converge? If you make each group one data point does it converge?
When the difference term is ode_data[:,i+ grp_size]-pred[:,grp_size]
and when we increase the number of iterations, it does try to converge but something is keeping itself from converging at the 5th point (last point of the first group). Here is convergence for 1500 iterations.
For any number of iterations, the 5th point stays at the same position.
If each group has one data point, then the predictions do not change at all.
yes look for an off-by-one error.
Hi, I tried correcting the term and there is improvement over single shooting methods as in the screenshots below:
But it still gets stuck in a local minima for the different function y = x*sin(3x) or when the datasize is too big.
This is how I am calculating loss:
tot_loss = 0
for i in grp_size:grp_size:datasize_mini
pred = predict_neuralode(ode_data[:,i-grp_size+1],tsteps[i-grp_size+1:i],p)
loss = sum(abs2, (ode_data[:,i-grp_size+1:i] .- pred[:,1:grp_size]))
tot_loss += (loss*loss) + (loss_multiplier * sum(abs,(ode_data[:,i-grp_size+1] - pred[:,1])))
What about grouping one data point per group in the spiral neural ODE example? I know that works, so if that doesn't work with your form then there's a bug.
Hi, I tried overlapping the groups. But when grouping one data point per group, I am not sure how losses will be calculated since there is no overlapping points for two adjacent groups? So for now I took a point and tried to predict the immediate next point.
function loss_neuralode(p)
tot_loss = 0
if(grp_size == 1)
for i in 1:datasize_mini-1
pred = predict_neuralode(ode_data[:,i],Array(tsteps[i:i+grp_size]),p)
loss = sum(abs2, (ode_data[:,i:i+grp_size] .- pred[:,1:1+grp_size]))
tot_loss += (loss*loss) + (loss_multiplier * sum(abs,(ode_data[:,i+ grp_size]-pred[:,1+grp_size])))
end
pred = predict_neuralode(u0, tsteps[tsteps .<= mini_timestep_value], p)
return tot_loss, pred
end
for i in 1:grp_size-1:datasize_mini-grp_size
pred = predict_neuralode(ode_data[:,i],tsteps[i:i+grp_size-1],p)
loss = sum(abs2, (ode_data[:,i:i+grp_size-1] .- pred[:,1:grp_size]))
tot_loss += (loss*loss) + (loss_multiplier * sum(abs,(pred[:,grp_size] - ode_data[:,i+grp_size-1])))
end
pred = predict_neuralode(u0, tsteps[tsteps .<= mini_timestep_value], p)
return tot_loss, pred
end
This decently captures the shape of the data and doesn't get stuck in local minima (for the example I am working on)
What if you do more iterations? 100 is kind of quick.
It randomly oscillates like the following plots, very likely due to a large step size (as the plots are just 2-3 iterations apart) (These are for grp_size =1, iterations = 300)
Start with 300 iterations of ADAM then do BFGS, like is shown in the tutorial. Post that.
That looks pretty good š
YESS š But I also tried it on y= x sin(3x) for only 10 data points, it got stuck in a minima really bad. Single shooting doesn't work either.
The question now is, how do you start to package it so that it's easier for users to do this? A function? Or just a tutorial? What are your thoughts?
I don't have a clear answer but I think making a function would make it easier for others to use it (without any bugs in indexing or dividing data into groups) and also wouldn't require much knowledge about multiple shooting but if someone implements it using the tutorial, it would give them quite some flexibility. But again, they can always implement it anyway in either case.
Try prototyping a design here. See if you can simplify down the user code by having something that helps calculate a multiple shooting loss.
Hi Chris, sorry for the delay as I got really busy with my mid-semester exams. But here is a function that helps calculating the loss.
function loss_neuralode_helper(p :: Array, ode_data :: Array, tsteps :: Array, prob :: ODEProblem, loss_function ::Function, grp_size :: Integer = 5, loss_multiplier :: Integer = 100)
tot_loss = 0
if(grp_size == 1)
for i in 1:datasize-1
pred = Array(solve(remake(prob, p=p, tspan=(tsteps[i],tsteps[i+1]), u0=ode_data[:,i]), Tsit5(),saveat = tsteps[i:i+1]))
tot_loss += loss_function(ode_data[:,i:i+1], pred[:,1:2]) + (loss_multiplier * sum(abs,ode_data[:,i+ 1]-pred[:,2]))
end
pred = predict_neuralode(u0, tsteps, p)
return tot_loss, pred
end
for i in 1:grp_size-1:datasize-grp_size
pred = solve(remake(prob, p=p, tspan=(tsteps[i],tsteps[i+grp_size-1]), u0=ode_data[:,i]), Tsit5(),saveat = tsteps[i:i+grp_size-1])
tot_loss += loss_function(ode_data[:,i:i+grp_size-1], pred[:,1:grp_size]) + (loss_multiplier * sum(abs,pred[:,grp_size] - ode_data[:,i+grp_size-1]))
end
pred = predict_neuralode(u0, tsteps, p)
return tot_loss, pred
end
This function can be called as follows:
function loss_neuralode(p)
return loss_neuralode_helper(p, ode_data, tsteps, prob_param, loss_function_param, grp_size_param, loss_multiplier_param)
end
This should be helpful as various parameters (like grp_size, loss_function) can also be changed while training itself giving more flexibility/power to the user. What do you think about this?
Yeah, looks useful. Start the PR
I am trying this same example using multi-shooting for my clarity .. but its not getting trained successfully,, used above-mentioned loss_neuralode_helper
from PR.. Anyone could help me where i am making mistake ? loss value is increasing instead
using DiffEqFlux,Flux
using GalacticOptim
using OrdinaryDiffEq
using Plots
using Plots: scatter, scatter!
gr() # just a backend for plotting
u0 = Float32[2.0; 0.0]
datasize = 30;
batchsize = 5;
loss_multiplier= 100
tspan = (0.0f0, 5.0f0)
tsteps = range(tspan[1], tspan[2], length = datasize)
function trueODEfunc(du, u, p, t)
true_A = [-0.1 2.0; -2.0 -0.1]
du .= ((u.^3)'true_A)'
end
prob_trueode = ODEProblem(trueODEfunc, u0, tspan)
ode_data = Array(solve(prob_trueode, Tsit5(), saveat = tsteps))
NN = Chain(x -> x.^3, Dense(2, 16, tanh), Dense(16, 2))
p,re = Flux.destructure(NN) # use this p as the initial condition!
dudt!(u,p,t) = re(p)(u) # need to restrcture for backprop!
prob = ODEProblem(dudt!, u0, tspan,p)
function loss_function(ode_data, pred)
return sum(abs2, ode_data .- pred)
end
function loss_neuralode_helper(p)
tot_loss = 0.0f0
if(batchsize == 1)
for i in 1:datasize-1
pred = Array(solve(prob,Tsit5(),u0=ode_data[:,i],
tspan=(tsteps[i],tsteps[i+1]),p=p,saveat = tsteps[i:i+1]))
loss = loss_function(ode_data[:,i:i+1], pred[:,1:2])
tot_loss += (loss*loss) + (loss_multiplier * sum(abs, ode_data[:,i+1]-pred[:,2]))
end
pred = Array(solve(prob, Tsit5(),u0=ode_data[:,1] ,
tspan = (tsteps[1],tsteps[datasize]), p=p ,saveat = tsteps))
return tot_loss, pred
end
for i in 1:batchsize-1:datasize-batchsize
pred = Array(solve(prob,Tsit5(), u0=ode_data[:,i],
tspan=(tsteps[i],tsteps[i+batchsize-1]),p=p,saveat = tsteps[i:i+batchsize-1]))
loss = loss_function(ode_data[:,i:i+batchsize-1], pred[:,1:batchsize])
tot_loss += (loss*loss) + (loss_multiplier * sum(abs,pred[:,batchsize] - ode_data[:,i+batchsize-1]))
end
pred = Array(solve(prob, Tsit5(),u0=ode_data[:,1],
tspan = (tsteps[1],tsteps[datasize]), p = p,saveat = tsteps))
return tot_loss, pred
end
loss_neuralode_helper(p) # initial loss value
callback = function (p, l, pred; doplot = true)
display(l)
# plot current prediction against data
plt = scatter(tsteps, ode_data[1,:], label = "data")
scatter!(plt, tsteps, pred[1,:], label = "prediction")
if doplot
display(plot(plt))
end
return false
end
adtype = GalacticOptim.AutoForwardDiff()
optf = GalacticOptim.OptimizationFunction((x, p) -> loss_neuralode_helper(x),adtype)
optfunc = GalacticOptim.instantiate_function(optf, prob.p, adtype, nothing)
optprob = GalacticOptim.OptimizationProblem(optfunc, prob.p)
result_neuralode = GalacticOptim.solve(optprob,ADAM(0.05),cb = callback,maxiters = 300)
Hey, it looks like in the code above you forgot to solve/define ode_data
. Besides that, your code is working fine, the predictions are converging but at a very slow rate. Also each iteration of training is slower using GalacticOptim.solve
(atleast for me).
These pics are from the loss function you defined:
Also in the above code I changed ADAM(0.01) to ADAM(0.05) which made it a lot easier and faster to converge.
In my code from PR, I used DiffEqFlux.sciml_train as in the tutorial to train which I found faster. Hope this helps
Hey, it looks like in the code above you forgot to solve/define
ode_data
.@Manas2030 sorry.. that was copy-paste issue..edited it now. Besides that, your code is working fine, the predictions are converging but at a very slow rate. Also each iteration of training is slower using
GalacticOptim.solve
(atleast for me). In my code from PR, I used DiffEqFlux.sciml_train as in the tutorial to train which I found faster. Hope this helpsI hope
batchsize
i defined is same asgrp_size
that you have plotted in Actually i tried using both the ways.. solve from galacticoptim and diffeqflux.sciml_train also..with larger learning rate too(0.05).and also with more relaxed learning rates...but it never could decrease the loss function :-(
Also each iteration of training is slower using GalacticOptim.solve (atleast for me).
That sounds very unlikely, but I would like to see code to reproduce if you have it handy to investigate
Also each iteration of training is slower using GalacticOptim.solve (atleast for me).
That sounds very unlikely, but I would like to see code to reproduce if you have it handy to investigate
Could you please try with code i have posted couple of comments above. ? I have used same loss function provided by @Manas2030 in the PR.. And also uses Galactic optim
@Vaibhavdixit02 Hi Vaibhav, you're right it turned out the step size in ADAM was 0.001 wherein I mistakenly read it as 0.01 It works just as fast on changing it. Sorry for that. Also could not remaking an ODEProblem (as in snippets below) in each iteration have added to the slowness? Just wondering
pred = Array(solve(prob,Tsit5(), u0=ode_data[:,i], tspan=(tsteps[i],tsteps[i+batchsize-1]),p=p,saveat = tsteps[i:i+batchsize-1]))
Below snippet is with remaking a problem:
pred = Array(solve(remake(prob, p = p, tspan = (tsteps[i],tsteps[i+batchsize-1]), u0 = ode_data[:,i]), Tsit5(),saveat = tsteps[i:i+batchsize-1]))
@yewalenikhil65 Hi Nikhil, could you please update/upload the exact code you're working on? Using the exact code you mentioned above, I was getting this error Tried to add a tstop that is behind the current time. This is strictly forbidden.
And also gr()
needs to imported. š
I am not sure if there's a cost associated with remake
. You can experiment on your end with both alternatives and time them to check
@yewalenikhil65 Hi Nikhil, could you please update/upload the exact code you're working on? Using the exact code you mentioned above, I was getting this error
Tried to add a tstop that is behind the current time. This is strictly forbidden.
And alsogr()
needs to imported.@Manas2030 , I have updated/edited in previous comment..
gr()
was just a backend forusing Plots
. you could use any backends inPlots
that u like.The other errorTried to add a tstop that is behind the current time. This is strictly forbidden.
I am not encountering that error if u just copy paste that code. https://github.com/SciML/DiffEqFlux.jl/issues/481#issuecomment-798972103
This is very very strange, I'm getting that error. Maybe someone else could look it up once? š
I would be glad if someone takes a look at https://github.com/SciML/DiffEqFlux.jl/issues/481#issuecomment-798972103 , as its loss function is exactly same as that of multiple shooting PR.
Besides, i was wondering if there is any cleaner way to split the dataset as the indices inside the for loop's were very confusing at first glance.
Could DataLoader
from Flux help ? Just that i don't see the data being overlapping at the end points using this approach ! Is that very much necessary condition for multiple shooting ?
train_loader= Flux.Data.DataLoader(ode_data, t,batchsize = k, partial=false)
for (x,y) in train_loader
@show x; # a chunk of size k of ode_data
@show y; # a chunk of size k of t
end
Training NN by splitting the datapoints and training on smaller intervals, to avoid being stuck in local minima while training.
Function I tried to train NN to predict values according to function, f(x) = xsin(3x) Here is the code I used to get my observations:
Observations: Training on the whole dataset: NN Architecture and hyperparameters used:
Loss function:
Iterations = 300, Loss = 43845.42f0
Taking a bigger NN, that is-
Iterations = 300, Loss = 43836.41f0
Splitting the datapoints and training on smaller intervals
Going back to old NN which is
And taking a new Loss function (that groups different points):
Observations- With x = 2, loss = 43907.97f0 With x = 5, loss = 44291.656f0 With x = 10, loss = 43852.062f0
Replacing "tot_loss += loss*loss" with tot_loss += log(cosh(loss)) With x = 2, doesn't work (producing NaN) With x = 5, loss = 43862.867f0 With x = 10, loss = 43873.414f0
Training on the first 30 points of the dataset
Taking, (datasize = 30), since that will be easier for NN to learn
Training on the whole dataset: Loss = 5765.231f0
Dividing dataset into groups as before:
With x = 2, loss = 5954.032f0 With x = 5, loss = 5966.1304f0 With x = 10, loss = 6044.2563f0
Again replacing "tot_loss += loss*loss" with tot_loss += log(cosh(loss))
With x = 2, loss = 5941.9297f0 With x = 5, loss = 6614.8916f0 With x = 10, loss = 6424.8003f0