DENG-MIT / CRNN

Chemical Reaction Neural Network
https://arxiv.org/abs/2002.09062
MIT License
75 stars 17 forks source link

Guidelines for bounds on clamping weights , scaling derivatives and slope #1

Closed yewalenikhil65 closed 3 years ago

yewalenikhil65 commented 3 years ago

Opening issue for bit detailed documentation on

jiweiqi commented 3 years ago

Thanks for organizing those tricks.

jiweiqi commented 3 years ago
jiweiqi commented 3 years ago

For catalysis reactions, the input and output weights only share signs but not values.

jiweiqi commented 3 years ago
jiweiqi commented 3 years ago

scaling loss by the standard deviation

Use it when the species concentrations span several orders of magnitude.

jiweiqi commented 3 years ago

sensealg for different cases (stiff and non-stiff)

I would suggest trying different ones and profile it. This depends on a lot of factors and Chris has a nice paper on it: https://arxiv.org/pdf/1812.01892.pdf

jiweiqi commented 3 years ago

when to bother about g_norm or gradient in the training process(like we don't bother much in simple case 1)

Use it when you see the gradient is fluctuating a lot. It is hard to say when it happens. Gradient clipping is kind of common practices in training RNN and neural ODE.

yewalenikhil65 commented 3 years ago

Thanks @jiweiqi , this is nice guideline. Forgot another thing tto ask hat I had in mind.

jiweiqi commented 3 years ago

Thanks @jiweiqi , this is nice guideline. Forgot another thing tto ask hat I had in mind.

  • Loss function for your trial codes was MSE based, whereas now I think you have adopted MAE measure. Any particular reason ?

This is a good question. it seems that MAE is preferred for neural ODE although I don't know the intuition. I noticed it when I try the ode demo code in the Pytorch package of torchdiffeq. To me, intuition is that the error is accumulated over time, so that the error could be substantially larger at a later phase. But we want the loss functions for the earlier phase well participate in the training so that it is better to use MAE since MSE will focus on large errors. But you know, those kinds of things are heuristic and we shall always try both.

yewalenikhil65 commented 3 years ago

Thanks @jiweiqi , this is nice guideline. Forgot another thing tto ask hat I had in mind.

  • Loss function for your trial codes was MSE based, whereas now I think you have adopted MAE measure. Any particular reason ?

This is a good question. it seems that MAE is preferred for neural ODE although I don't know the intuition. I noticed it when I try the ode demo code in the Pytorch package of torchdiffeq. To me, intuition is that the error is accumulated over time, so that the error could be substantially larger at a later phase. But we want the loss functions for the earlier phase well participate in the training so that it is better to use MAE since MSE will focus on large errors. But you know, those kinds of things are heuristic and we shall always try both.

I made up a logic of my own by creating an analogy with regularization(which is also part of loss function many times). L1 is analogical to MAE, whereas L2 is analogical to MSE. So just as we tend to use L1 methods for inducing sparsity or taking care of outliers, I think MAE does similar job. MSE on the other hand tends to induce a bias towards outliers.

This logic is of course on assumption that we have less outliers, and hence are using L1 (MAE) type of metric

yewalenikhil65 commented 3 years ago

Is (w_out .* dydt_scale)' == w_out' .* dydt_scale' Case 3 and Robertson's case seem suggest so while printing w_out_scale

jiweiqi commented 3 years ago

Is (w_out .* dydt_scale)' == w_out' .* dydt_scale' Case 3 and Robertson's case seem suggest so while printing w_out_scale

I think so since it is an elementwise product.

yewalenikhil65 commented 3 years ago

Is (w_out .* dydt_scale)' == w_out' .* dydt_scale' Case 3 and Robertson's case seem suggest so while printing w_out_scale

I think so since it is an elementwise product.

for normal matrix product, its (A x B)' = B' x A'

I do not know any rule for element-wise product though.. Besides, w_out is a matrix and dydt_scale is vector.. So I got confused how it is even evaluating.. But it does evaluate, so i am guessing its not matrix-vector product.. must be matrix-matrix elementwise product.. and the rule seems true image

jiweiqi commented 3 years ago

Yeah, it is certainly not a matrix-vector product, instead, it broadcast in a certain way. I am always confused about it. It is a good practice to check in the terminal as you did :)

yewalenikhil65 commented 3 years ago
    w_out_ = (w_out' .* dydt_scale') .* exp.(w_b)   # scaling w_out
    display(w_out_)
    display(maximum(abs.(w_out_), dims=2)')     # extracting maximum of absolute values from each row
    display(w_out_ ./ maximum(abs.(w_out_), dims=2))  # writing off 1's in place of maximum

Am I correct with these comments ? Shouldn't display(maximum(abs.(w_out_), dims=2)') be also used in scaling w_b ?

jiweiqi commented 3 years ago

I think you're right unless you see the results look strange, then we might come back to check the formula.

yewalenikhil65 commented 3 years ago

For catalysis reactions, the input and output weights only share signs but not values.

@jiweiqi Our w_in demands a value, but w_out demands it to be zero for catalytic reaction. , Any guidelines for clamping weights for catalysis reaction systems ?

jiweiqi commented 3 years ago

I don't think the training is sensitive to the boundary of the clamping for w_out. For w_in, we shall take care of it such that it is not too large, which will induce strong stiffness, and also not realistic. Most of the reactions are no more than second order, i.e., bio-molecular reaction. It is also possible with three molecular, although. I think 2.5 would be a good choice of upper bound for w_in.

yewalenikhil65 commented 3 years ago

Oh no, I didn't mean to ask about training sensitivity. I understood slightly after re-reading your paper and code now . I was referring to this,

#case1 (also case 2)

  w_in = clamp.(-w_out, 0, 2.5);

but in case 3


    w_in = clamp.(w_in, 0.f0, 4.f0);

This refers to the quote "except that the sharing parameter between input weights and output weights are relaxed since the stoichiometric coefficients (output weights) for the catalysis could be zero while the reaction orders (input weights) are non-zero" from the paper. Am I correct ?

jiweiqi commented 3 years ago

Yes, you are right, for case 3, we don't binding w_in and w_out, as did for case 1 and case 2. Similarly, for Robertson's problem, we don't binding the weights.