Guidelines for bounds on clamping weights , scaling derivatives and slope

yewalenikhil65 commented 3 years ago

Opening issue for bit detailed documentation on

slope,
scaling dydt,
bounds on clamping weights, scaling weights
scaling loss by the standard deviation
sensealg for different cases (stiff and non-stiff)
when to bother about g_norm or gradient in the training process(like we don't bother much in simple case 1)
tanh vs gelu ? Comments on different activations in different cases

jiweiqi commented 3 years ago

Thanks for organizing those tricks.

for slope, for example, https://github.com/DENG-MIT/CRNN/blob/e7242f6907634299a8e88ffad1b4e466eb764e80/case2/case2.jl#L99 The general guideline is that, if the activation energies or the prefactor logA are far away from zero (comparing to unity), it is recommended to have a slope to rescale the weights. This is because those weights are usually initialized from Gaussian distributions. We shall try our best to push the weights to Gaussian distribution as well to make the optimization easier.

jiweiqi commented 3 years ago

scaling dydt, for example, https://github.com/DENG-MIT/CRNN/blob/e7242f6907634299a8e88ffad1b4e466eb764e80/case3/case3.jl#L165 It is useful if the species concentrations for different species span several orders of magnitude. Similar to the intuitions for the close, it is useful to scale the NN output to be close to Gaussian distribution. A natural choice of the scaling is the 'maximum concentrations (or range) / t_end'

jiweiqi commented 3 years ago

bounds on clamping weights, scaling weights For non-catalysis reactions, one can simply bind the input weights and output weights, e.g., https://github.com/DENG-MIT/CRNN/blob/e7242f6907634299a8e88ffad1b4e466eb764e80/case2/case2.jl#L107. This is a kind of sharing parameter. My feeling is that the binding improves the robustness of training and reduce training cost. For elemental reactions, the output weights should be unity (or two). So that one can simply constrain those weights within [-2, 2]. To have some flexibility, and keep the loss curves being smooth, you can relax a little bit, say [-3, 3], [-4, 4].

For catalysis reactions, the input and output weights only share signs but not values.

jiweiqi commented 3 years ago

tanh vs gelu ? Comments on different activations in different cases. You are probably talking about the neural ode with standard NN. We can discuss that privately.

jiweiqi commented 3 years ago

scaling loss by the standard deviation

Use it when the species concentrations span several orders of magnitude.

jiweiqi commented 3 years ago

sensealg for different cases (stiff and non-stiff)

I would suggest trying different ones and profile it. This depends on a lot of factors and Chris has a nice paper on it: https://arxiv.org/pdf/1812.01892.pdf

jiweiqi commented 3 years ago

when to bother about g_norm or gradient in the training process(like we don't bother much in simple case 1)

Use it when you see the gradient is fluctuating a lot. It is hard to say when it happens. Gradient clipping is kind of common practices in training RNN and neural ODE.

yewalenikhil65 commented 3 years ago

Thanks @jiweiqi , this is nice guideline. Forgot another thing tto ask hat I had in mind.

Loss function for your trial codes was MSE based, whereas now I think you have adopted MAE measure. Any particular reason ?

jiweiqi commented 3 years ago

Thanks @jiweiqi , this is nice guideline. Forgot another thing tto ask hat I had in mind.

Loss function for your trial codes was MSE based, whereas now I think you have adopted MAE measure. Any particular reason ?

This is a good question. it seems that MAE is preferred for neural ODE although I don't know the intuition. I noticed it when I try the ode demo code in the Pytorch package of torchdiffeq. To me, intuition is that the error is accumulated over time, so that the error could be substantially larger at a later phase. But we want the loss functions for the earlier phase well participate in the training so that it is better to use MAE since MSE will focus on large errors. But you know, those kinds of things are heuristic and we shall always try both.

yewalenikhil65 commented 3 years ago

Thanks @jiweiqi , this is nice guideline. Forgot another thing tto ask hat I had in mind.

Loss function for your trial codes was MSE based, whereas now I think you have adopted MAE measure. Any particular reason ?

This is a good question. it seems that MAE is preferred for neural ODE although I don't know the intuition. I noticed it when I try the ode demo code in the Pytorch package of torchdiffeq. To me, intuition is that the error is accumulated over time, so that the error could be substantially larger at a later phase. But we want the loss functions for the earlier phase well participate in the training so that it is better to use MAE since MSE will focus on large errors. But you know, those kinds of things are heuristic and we shall always try both.

I made up a logic of my own by creating an analogy with regularization(which is also part of loss function many times). L1 is analogical to MAE, whereas L2 is analogical to MSE. So just as we tend to use L1 methods for inducing sparsity or taking care of outliers, I think MAE does similar job. MSE on the other hand tends to induce a bias towards outliers.

This logic is of course on assumption that we have less outliers, and hence are using L1 (MAE) type of metric

yewalenikhil65 commented 3 years ago

Is (w_out .* dydt_scale)' == w_out' .* dydt_scale' Case 3 and Robertson's case seem suggest so while printing w_out_scale

jiweiqi commented 3 years ago

Is (w_out .* dydt_scale)' == w_out' .* dydt_scale' Case 3 and Robertson's case seem suggest so while printing w_out_scale

I think so since it is an elementwise product.

yewalenikhil65 commented 3 years ago

Is (w_out .* dydt_scale)' == w_out' .* dydt_scale' Case 3 and Robertson's case seem suggest so while printing w_out_scale

I think so since it is an elementwise product.

for normal matrix product, its (A x B)' = B' x A'

I do not know any rule for element-wise product though.. Besides, w_out is a matrix and dydt_scale is vector.. So I got confused how it is even evaluating.. But it does evaluate, so i am guessing its not matrix-vector product.. must be matrix-matrix elementwise product.. and the rule seems true

jiweiqi commented 3 years ago

Yeah, it is certainly not a matrix-vector product, instead, it broadcast in a certain way. I am always confused about it. It is a good practice to check in the terminal as you did :)

yewalenikhil65 commented 3 years ago

    w_out_ = (w_out' .* dydt_scale') .* exp.(w_b)   # scaling w_out
    display(w_out_)
    display(maximum(abs.(w_out_), dims=2)')     # extracting maximum of absolute values from each row
    display(w_out_ ./ maximum(abs.(w_out_), dims=2))  # writing off 1's in place of maximum

Am I correct with these comments ? Shouldn't display(maximum(abs.(w_out_), dims=2)') be also used in scaling w_b ?

jiweiqi commented 3 years ago

I think you're right unless you see the results look strange, then we might come back to check the formula.

yewalenikhil65 commented 3 years ago

For catalysis reactions, the input and output weights only share signs but not values.

@jiweiqi Our w_in demands a value, but w_out demands it to be zero for catalytic reaction. , Any guidelines for clamping weights for catalysis reaction systems ?

jiweiqi commented 3 years ago

I don't think the training is sensitive to the boundary of the clamping for w_out. For w_in, we shall take care of it such that it is not too large, which will induce strong stiffness, and also not realistic. Most of the reactions are no more than second order, i.e., bio-molecular reaction. It is also possible with three molecular, although. I think 2.5 would be a good choice of upper bound for w_in.

yewalenikhil65 commented 3 years ago

Oh no, I didn't mean to ask about training sensitivity. I understood slightly after re-reading your paper and code now . I was referring to this,

#case1 (also case 2)

  w_in = clamp.(-w_out, 0, 2.5);

but in case 3


    w_in = clamp.(w_in, 0.f0, 4.f0);

This refers to the quote "except that the sharing parameter between input weights and output weights are relaxed since the stoichiometric coefficients (output weights) for the catalysis could be zero while the reaction orders (input weights) are non-zero" from the paper. Am I correct ?

jiweiqi commented 3 years ago

Yes, you are right, for case 3, we don't binding w_in and w_out, as did for case 1 and case 2. Similarly, for Robertson's problem, we don't binding the weights.

DENG-MIT / CRNN

Guidelines for bounds on clamping weights , scaling derivatives and slope #1