Harry-Westwood / Y4-Project-InterNeuralStellar

Bayesian Hierarchical Modelling and Machine Learning of Stellar Populations
1 stars 0 forks source link

How to determine if to use bigger or smaller NN architecture #20

Closed HinLeung622 closed 4 years ago

HinLeung622 commented 4 years ago

@grd349 Basically, the main question of this issue is, what should I look for in the test-training results, when deciding if I should use a bigger or smaller architecture.

Context: I have been running NN fitting tests on the small grid. The following three fit results are all ran under the same NN settings, except the number of neurons in each layer: inputs = mass, age, feh, MLT (feh and MLT do not vary in small grid) outputs = luminosity, Teff, delta nu 4 hidden layers 0.0001 learning rate 0.9999 beta_1 0.999 beta_2 0.0001 l2 regularizer 50k epochs all 9211 datapoints, with batch size = 9211

NN A has 32 neurons in each dense layer, and this is the resulting HR diagram (left = NN, right = grid): HR20

NN B has 128 neurons per layer, and HR diagram: HR22

NN C has 256 neurons per layer, and HR diagram: HR23

HR diagram of NN A was quite jaggered at regions where it should be straight, while the larger architecture B and C are able to smooth that part out. In general, I would say C has the most close-looking shape to the proper HR diagram at the right.

Here is a table documenting some numerical results from the three NNs:

  Run time (hr) final learning loss evaluation loss L dex Teff dex delnu dex
A 1.5 0.00999 0.0064 - - -
B 5 0.00444 0.0029 0.0155 0.0043 0.01129
C 9 0.003845 0.0022 0.01348 0.003336 0.01015

(both losses are in MAE, run time is only approximate. The function for calculating dex was not written when model A was ran, so no data there.) Judging from just the resulting losses and dex, it seems that the more complex the architecture is, the more accurate is the result. Judging from the history plots (learning loss and validation loss vs epoch), the validation loss have been closely following the drop in learning loss in growing epochs in all three cases, suggesting against overfitting. So, does this mean that I should keep on increasing the complexity of the architecture as I have not hit the spot where it is just big enough but still smaller than to overfit? I also don't know how much I should be concerned about run time.

It should also be noted that the dex in both fit B and C (and certainly A if it were to be measured) for luminosity and delta nu are still not low enough to hit the target 0.005 dex set 2 weeks ago, but since judging from the history plots the MAE was still steadily declining (not flattened off), I believe more training can get the dex to decrease to perhaps under 0.005 dex.

grd349 commented 4 years ago

Hi @HinLeung622

I agree with pretty much everything you say! It looks like C is doing best. It looks like (from what you have said) all three could do with training for longer. I have some suggestions but we can discuss them here if you need now.

Having said all of that - the aim here is for you to get a feel for what is going on , what helps, and what doesn't. And in that sense I feel like you are making excellent progress. How do you feel the 'understanding' is coming along?

@Harry-Westwood - how are you getting on. Does the above make any sense?

HinLeung622 commented 4 years ago

@grd349 Yeah I do think I have achieved some understanding towards what each parameter that goes into the NN does to the result, and should be able to alter them according to the looks of problems i run into when we are doing the real deal. Ok I will implement those suggested changes independently one by one, but since I will be doing bigger architectures, would you be able to run them on the GPU for me even though you are home?

grd349 commented 4 years ago

Yes - I can run these for you!