Closed yewalenikhil65 closed 3 years ago
Thanks for organizing those tricks.
slope
, for example, https://github.com/DENG-MIT/CRNN/blob/e7242f6907634299a8e88ffad1b4e466eb764e80/case2/case2.jl#L99
The general guideline is that, if the activation energies or the prefactor logA are far away from zero (comparing to unity), it is recommended to have a slope to rescale the weights. This is because those weights are usually initialized from Gaussian distributions. We shall try our best to push the weights to Gaussian distribution as well to make the optimization easier.dydt
, for example, https://github.com/DENG-MIT/CRNN/blob/e7242f6907634299a8e88ffad1b4e466eb764e80/case3/case3.jl#L165
It is useful if the species concentrations for different species span several orders of magnitude. Similar to the intuitions for the close, it is useful to scale the NN output to be close to Gaussian distribution. A natural choice of the scaling is the 'maximum concentrations (or range) / t_end'
[-2, 2]
. To have some flexibility, and keep the loss curves being smooth, you can relax a little bit, say [-3, 3], [-4, 4]
.For catalysis reactions, the input and output weights only share signs but not values.
scaling loss by the standard deviation
Use it when the species concentrations span several orders of magnitude.
sensealg for different cases (stiff and non-stiff)
I would suggest trying different ones and profile it. This depends on a lot of factors and Chris has a nice paper on it: https://arxiv.org/pdf/1812.01892.pdf
when to bother about g_norm or gradient in the training process(like we don't bother much in simple case 1)
Use it when you see the gradient is fluctuating a lot. It is hard to say when it happens. Gradient clipping is kind of common practices in training RNN and neural ODE.
Thanks @jiweiqi , this is nice guideline. Forgot another thing tto ask hat I had in mind.
Thanks @jiweiqi , this is nice guideline. Forgot another thing tto ask hat I had in mind.
- Loss function for your trial codes was MSE based, whereas now I think you have adopted MAE measure. Any particular reason ?
This is a good question. it seems that MAE is preferred for neural ODE although I don't know the intuition. I noticed it when I try the ode demo code in the Pytorch package of torchdiffeq. To me, intuition is that the error is accumulated over time, so that the error could be substantially larger at a later phase. But we want the loss functions for the earlier phase well participate in the training so that it is better to use MAE since MSE will focus on large errors. But you know, those kinds of things are heuristic and we shall always try both.
Thanks @jiweiqi , this is nice guideline. Forgot another thing tto ask hat I had in mind.
- Loss function for your trial codes was MSE based, whereas now I think you have adopted MAE measure. Any particular reason ?
This is a good question. it seems that MAE is preferred for neural ODE although I don't know the intuition. I noticed it when I try the ode demo code in the Pytorch package of torchdiffeq. To me, intuition is that the error is accumulated over time, so that the error could be substantially larger at a later phase. But we want the loss functions for the earlier phase well participate in the training so that it is better to use MAE since MSE will focus on large errors. But you know, those kinds of things are heuristic and we shall always try both.
I made up a logic of my own by creating an analogy with regularization(which is also part of loss function many times). L1 is analogical to MAE, whereas L2 is analogical to MSE. So just as we tend to use L1 methods for inducing sparsity or taking care of outliers, I think MAE does similar job. MSE on the other hand tends to induce a bias towards outliers.
This logic is of course on assumption that we have less outliers, and hence are using L1 (MAE) type of metric
Is (w_out .* dydt_scale)' == w_out' .* dydt_scale'
Case 3 and Robertson's case seem suggest so while printing w_out_scale
Is
(w_out .* dydt_scale)' == w_out' .* dydt_scale'
Case 3 and Robertson's case seem suggest so while printingw_out_scale
I think so since it is an elementwise product.
Is
(w_out .* dydt_scale)' == w_out' .* dydt_scale'
Case 3 and Robertson's case seem suggest so while printingw_out_scale
I think so since it is an elementwise product.
for normal matrix product, its (A x B)' = B' x A'
I do not know any rule for element-wise product though.. Besides, w_out
is a matrix and dydt_scale
is vector.. So I got confused how it is even evaluating.. But it does evaluate, so i am guessing its not matrix-vector product.. must be matrix-matrix elementwise product.. and the rule seems true
Yeah, it is certainly not a matrix-vector product, instead, it broadcast in a certain way. I am always confused about it. It is a good practice to check in the terminal as you did :)
w_out_ = (w_out' .* dydt_scale') .* exp.(w_b) # scaling w_out
display(w_out_)
display(maximum(abs.(w_out_), dims=2)') # extracting maximum of absolute values from each row
display(w_out_ ./ maximum(abs.(w_out_), dims=2)) # writing off 1's in place of maximum
Am I correct with these comments ?
Shouldn't display(maximum(abs.(w_out_), dims=2)')
be also used in scaling w_b
?
I think you're right unless you see the results look strange, then we might come back to check the formula.
For catalysis reactions, the input and output weights only share signs but not values.
@jiweiqi
Our w_in
demands a value, but w_out
demands it to be zero for catalytic reaction. , Any guidelines for clamping weights for catalysis reaction systems ?
I don't think the training is sensitive to the boundary of the clamping for w_out
. For w_in
, we shall take care of it such that it is not too large, which will induce strong stiffness, and also not realistic. Most of the reactions are no more than second order, i.e., bio-molecular reaction. It is also possible with three molecular, although. I think 2.5 would be a good choice of upper bound for w_in.
Oh no, I didn't mean to ask about training sensitivity. I understood slightly after re-reading your paper and code now . I was referring to this,
#case1 (also case 2)
w_in = clamp.(-w_out, 0, 2.5);
but in case 3
w_in = clamp.(w_in, 0.f0, 4.f0);
This refers to the quote "except that the sharing parameter between input weights and output weights are relaxed since the stoichiometric coefficients (output weights) for the catalysis could be zero while the reaction orders (input weights) are non-zero" from the paper. Am I correct ?
Yes, you are right, for case 3, we don't binding w_in and w_out, as did for case 1 and case 2. Similarly, for Robertson's problem, we don't binding the weights.
Opening issue for bit detailed documentation on
slope
,dydt
,clamping weights
, scaling weightsg_norm
or gradient in the training process(like we don't bother much in simple case 1)tanh
vsgelu
? Comments on different activations in different cases