NNPDF / nnpdf

An open-source machine learning framework for global analyses of parton distributions.
https://docs.nnpdf.science/
GNU General Public License v3.0
28 stars 6 forks source link

NaNs in hyperopt #2015

Open APJansen opened 5 months ago

APJansen commented 5 months ago

I created a small script that gets the parameters from a trial and creates a runcard from that here (where I means GPT4).

Side issue, bug?

Regardless of using the script or doing it manually, there is a small issue: in the parameters reported in tries.json I always see only one value for activation_per_layer (regardless of the number of layers). I'm not sure if that's a bug, or if it's intended to mean that all layers have this same activation (apart from the last which is always linear I think?). Even if it's the latter, it doesn't pass the checks if I leave it like this.

Example: trial 28

In this example there were only 2 layers, so tanh could only be [tanh, linear].

This is an example with a NaN loss, it seems in this case something goes wrong in some of the folds.

I'm running this now but without hyperopt, so without folds, so I realise now I won't reproduce the issue.

The entry in the tries.json is:

trial_28.json ``` { "_id": "65ec6c12cdc17306b5a546b5", "state": 2, "tid": 28, "spec": null, "result": { "status": "ok", "loss": null, "validation_loss": null, "experimental_loss": null, "kfold_meta": { "validation_losses": [ [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ 12.14753532409668, 622.563720703125, 15.752861022949219, 22.73064613342285, 13.940808296203613, 12.795502662658691, 485.9554138183594, 14.621479034423828, 25.957843780517578, 11.59742546081543, 21.798128128051758, 13.857935905456543, 16.326919555664062, 20.212799072265625, 8.943603515625, 12.832194328308105, 86.92933654785156, 12.584427833557129, 28.086872100830078, 11.063165664672852, 11.780959129333496, 23.072505950927734, 13.89393424987793, 18.552104949951172, 13.357744216918945, 12.194093704223633, 13.462635040283203, 9.592662811279297, 18.079463958740234, 9.498067855834961, 170.617431640625, 1038.8602294921875, 18.29841423034668, 15.70957088470459, 12.098176002502441, 19.777820587158203, 14.42442798614502, 10.0394868850708, 47.44027328491211, 8.413172721862793, 11.434549331665039, 1968.7259521484375, 15.88143253326416, 2881.301025390625, 13.021681785583496, 8.430719375610352, 200.9595947265625, 12.42056941986084, 41.56287384033203, 19.01072883605957, 12.09191608428955, 17.90202522277832, 14.396261215209961, 11.48803424835205, 24.40220832824707, 11.868151664733887, 10.247714042663574, 17.367982864379883, 13.672079086303711, 10.491067886352539, 13.204983711242676, 270.6560974121094, 1698.0059814453125, 11.53370475769043, 12.530192375183105, 9.645076751708984, 22.128299713134766, 17.130889892578125, 8.982980728149414, 10.918508529663086, 11.656984329223633, 13.176027297973633, 72.55720520019531, 14.917561531066895, 17.282772064208984, 14.455068588256836, 10.749722480773926, 13.771191596984863, 12.240583419799805, 11.156545639038086, 23.284114837646484, 12.345758438110352, 11.330618858337402, 13.840761184692383, 10.85977554321289, 10.69460391998291, 16.99068260192871, 10.387388229370117, 19.446277618408203, 10.275043487548828, 11.654125213623047, 18.474218368530273, 17.532350540161133, 15.129594802856445, 15.477754592895508, 19.633609771728516, 22.094167709350586, 16.409204483032227, 9.5444917678833, 13.123198509216309 ] ], "trvl_losses_phi": [ null, null, 38.75936293123688 ], "experimental_losses": [ [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ 9.631409645080566, 14.65926742553711, 14.765120506286621, 22.81585121154785, 18.704925537109375, 11.019681930541992, 24401.271484375, 16.35789680480957, 30.54192352294922, 9.4488525390625, 16.245166778564453, 12.085138320922852, 15.265380859375, 14.113433837890625, 9.736549377441406, 9.431405067443848, 99.01387023925781, 11.187447547912598, 23.340747833251953, 7.108949184417725, 10.156389236450195, 30.652923583984375, 16.785707473754883, 20.611572265625, 13.525415420532227, 9.401630401611328, 10.916632652282715, 6.493870258331299, 14.7423095703125, 9.452704429626465, 8916.083984375, 3220.8310546875, 12.99640941619873, 12.243772506713867, 8.79442310333252, 23.920921325683594, 13.531204223632812, 7.656867027282715, 48.91118621826172, 9.483509063720703, 10.30264663696289, 15796.875, 11.896005630493164, 38.02886962890625, 10.020211219787598, 6.9951629638671875, 21.850326538085938, 8.467232704162598, 51.03904342651367, 11.28868293762207, 10.1692533493042, 18.222007751464844, 13.1695556640625, 8.790900230407715, 41.1088752746582, 11.083891868591309, 11.740309715270996, 12.803691864013672, 9.846954345703125, 6.113747596740723, 8.874459266662598, 706.6709594726562, 28218322.0, 8.474566459655762, 10.354934692382812, 9.65860652923584, 23.21355628967285, 14.157861709594727, 7.734203338623047, 9.123676300048828, 7.600849628448486, 11.874972343444824, 718.5275268554688, 13.111132621765137, 12.585481643676758, 11.613744735717773, 6.636402606964111, 8.746490478515625, 7.64871072769165, 9.85733413696289, 17.32428741455078, 10.904479026794434, 8.159088134765625, 14.700692176818848, 9.440145492553711, 7.42826509475708, 17.706758499145508, 9.940926551818848, 24.914688110351562, 6.253153324127197, 12.495806694030762, 18.51062774658203, 16.331514358520508, 11.867968559265137, 10.207791328430176, 19.361270904541016, 18.218006134033203, 14.038684844970703, 8.058456420898438, 8.859779357910156 ] ], "hyper_losses": [ [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ 9.631409645080566, 14.65926742553711, 14.765120506286621, 22.81585121154785, 18.704925537109375, 11.019681930541992, 24401.271484375, 16.35789680480957, 30.54192352294922, 9.4488525390625, 16.245166778564453, 12.085138320922852, 15.265380859375, 14.113433837890625, 9.736549377441406, 9.431405067443848, 99.01387023925781, 11.187447547912598, 23.340747833251953, 7.108949184417725, 10.156389236450195, 30.652923583984375, 16.785707473754883, 20.611572265625, 13.525415420532227, 9.401630401611328, 10.916632652282715, 6.493870258331299, 14.7423095703125, 9.452704429626465, 8916.083984375, 3220.8310546875, 12.99640941619873, 12.243772506713867, 8.79442310333252, 23.920921325683594, 13.531204223632812, 7.656867027282715, 48.91118621826172, 9.483509063720703, 10.30264663696289, 15796.875, 11.896005630493164, 38.02886962890625, 10.020211219787598, 6.9951629638671875, 21.850326538085938, 8.467232704162598, 51.03904342651367, 11.28868293762207, 10.1692533493042, 18.222007751464844, 13.1695556640625, 8.790900230407715, 41.1088752746582, 11.083891868591309, 11.740309715270996, 12.803691864013672, 9.846954345703125, 6.113747596740723, 8.874459266662598, 706.6709594726562, 28218322.0, 8.474566459655762, 10.354934692382812, 9.65860652923584, 23.21355628967285, 14.157861709594727, 7.734203338623047, 9.123676300048828, 7.600849628448486, 11.874972343444824, 718.5275268554688, 13.111132621765137, 12.585481643676758, 11.613744735717773, 6.636402606964111, 8.746490478515625, 7.64871072769165, 9.85733413696289, 17.32428741455078, 10.904479026794434, 8.159088134765625, 14.700692176818848, 9.440145492553711, 7.42826509475708, 17.706758499145508, 9.940926551818848, 24.914688110351562, 6.253153324127197, 12.495806694030762, 18.51062774658203, 16.331514358520508, 11.867968559265137, 10.207791328430176, 19.361270904541016, 18.218006134033203, 14.038684844970703, 8.058456420898438, 8.859779357910156 ] ], "hyper_losses_phi": [ null, null, 396.61186935111425 ], "penalties": { "saturation": [ [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ 2.656452720743435, 2.249868713454709, 3.460148973652974, 2.5260735502132126, 8.470907919082244, 1.6593138110891172, 3.7240513480015442, 8.038696890402889, 4.5444754748274505, 3.563270434807783, 3.7467510736960374, 3.6940533026299294, 7.929168726902323, 3.892730539625497, 1.7669792133609266, 7.179103624182386, 4.131493868852873, 2.441749608692859, 1.076389435456914, 1.42457110471527, 2.3585938773041617, 2.0813801407587738, 3.2614185887920843, 4.60568097490542, 4.343469451914197, 4.20434783097259, 3.546487762593795, 2.973756548448712, 8.242689710085843, 4.247226755602139, 20.216802089208148, 3.2183613123048094, 8.518471686376078, 1.702957497812683, 3.3400091159831646, 6.146248498185259, 5.866374523786929, 1.6764654631239282, 5.112705463604855, 1.999221971310929, 3.7193412462183697, 6.185707114643248, 8.803578482470403, 12.474585604276028, 5.1580703384345155, 3.495984115718236, 5.074407949986764, 4.313428859615183, 1.2995852554394078, 3.592075168097778, 11.149528321343743, 4.07579561114684, 3.4075703338036583, 3.56782547983067, 2.909606493835124, 2.0209598148033967, 1.624241943059869, 3.9396620448150257, 3.26786771366886, 3.864827440838045, 7.300091545825602, 4.785786568126305, 2.5495437344121736, 3.240279007403567, 5.164385378324037, 1.6742697373552722, 2.0716361099129386, 3.0229719030964954, 3.278816181585102, 3.3674583655389574, 3.737867267618177, 12.283305807394688, 3.01337533331514, 4.448669309358695, 3.7443274451475235, 2.7275068789200505, 6.856379346358085, 2.2148164849944143, 2.8342377484971792, 3.2935560123086725, 1.429871937174859, 3.00909963057026, 7.050622830772754, 2.2437751555556886, 10.290600688618103, 2.9732550756312732, 2.5412259712469645, 6.3067344764985656, 1.376964008697831, 2.2924383988336774, 4.480360846259685, 2.560406944097767, 4.292233686515102, 3.554847270642838, 3.2680043348751213, 6.125173712584803, 1.7976846787349947, 1.6517489570475106, 2.4071141949347563, 1.3218436138544791 ] ], "patience": [ [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ 19.55881921053845, 1002.3935667112833, 25.363775661176202, 36.59874916193362, 22.446178738278096, 20.602115294158654, 782.4397155200273, 23.54213076110071, 41.79487937700625, 18.673084032595916, 35.097296349407515, 22.31274540674855, 26.288070734874335, 32.544748563864005, 14.400149461253237, 20.661192764264612, 139.96544420486507, 20.26226245058606, 45.222840597131786, 17.81286914246232, 18.968592688946227, 37.14917969681439, 22.37070655620246, 29.87085503437409, 21.507383780237017, 19.633783136581307, 21.676269121690314, 15.445203711512503, 29.10985295025199, 15.263866892684733, 274.7121461755074, 1672.6750630105405, 29.462386090859855, 25.29407395091681, 19.47934547190264, 31.844387105864076, 23.224857690827182, 16.164637822633658, 76.38386748606817, 13.546099670731726, 18.410836203816565, 3169.8574192892793, 25.570789418188088, 4639.199997629558, 20.966287651468786, 13.574351642777252, 323.5662442638088, 19.998433039589912, 66.9206315211043, 30.60928809381414, 19.469266340183154, 28.824157781229413, 23.17950620463964, 18.496952587762934, 39.29014143991154, 19.108981911196256, 16.499905613294246, 27.964292988906912, 22.013496231679675, 16.891731091130925, 21.26142317725727, 435.7849996955645, 2733.969576826188, 18.570486947743813, 20.17493761498472, 11.940638759679912, 35.6289075914251, 27.58254817830644, 27.683366815915562, 17.57995582501281, 18.768980122502278, 21.214800283229426, 116.82478964415694, 24.018854958128493, 27.827094570391356, 23.274192307374314, 17.308192399148503, 22.17307787735473, 19.708636505420664, 17.96322079716147, 37.48989242974758, 19.87797943085513, 18.243497127695615, 22.28509228622201, 17.48538949258193, 17.21944567506876, 27.3567995818507, 16.72479585590611, 31.310567790809838, 16.543908915805723, 18.764376643389692, 29.74544938423234, 28.228942365829052, 24.360250996, 24.920825154302307, 31.612192410237345, 35.57395144818311, 26.420558188107233, 19.79173071603307, 21.12974033478321 ] ], "integrability": [ [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null ], [ 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9194395160552102, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 89.36382512376943, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9475459004730196, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ] ] } } }, "misc": { "tid": 28, "cmd": [ "domain_attachment", "FMinIter_Domain" ], "workdir": null, "idxs": { "Adadelta_clipnorm": [], "Adadelta_learning_rate": [], "Adam_clipnorm": [], "Adam_learning_rate": [], "Amsgrad_clipnorm": [ 28 ], "Amsgrad_learning_rate": [ 28 ], "Nadam_clipnorm": [], "Nadam_learning_rate": [], "activation_per_layer": [ 28 ], "dropout": [ 28 ], "epochs": [ 28 ], "initial": [ 28 ], "initializer": [ 28 ], "nl1:-0/1": [ 28 ], "nl2:-0/2": [], "nl2:-1/2": [], "nl3:-0/3": [], "nl3:-1/3": [], "nl3:-2/3": [], "nl4:-0/4": [], "nl4:-1/4": [], "nl4:-2/4": [], "nl4:-3/4": [], "nodes_per_layer": [ 28 ], "optimizer": [ 28 ], "stopping_patience": [ 28 ] }, "vals": { "Adadelta_clipnorm": [], "Adadelta_learning_rate": [], "Adam_clipnorm": [], "Adam_learning_rate": [], "Amsgrad_clipnorm": [ 0.000009949736889655482 ], "Amsgrad_learning_rate": [ 0.001125607109055478 ], "Nadam_clipnorm": [], "Nadam_learning_rate": [], "activation_per_layer": [ 0 ], "dropout": [ 0.4 ], "epochs": [ 39703.0 ], "initial": [ 52.261799207196255 ], "initializer": [ 1 ], "nl1:-0/1": [ 32.0 ], "nl2:-0/2": [], "nl2:-1/2": [], "nl3:-0/3": [], "nl3:-1/3": [], "nl3:-2/3": [], "nl4:-0/4": [], "nl4:-1/4": [], "nl4:-2/4": [], "nl4:-3/4": [], "nodes_per_layer": [ 0 ], "optimizer": [ 1 ], "stopping_patience": [ 0.11999999999999998 ] }, "space_vals": { "activation_per_layer": "tanh", "dropout": 0.4, "epochs": 39703, "initializer": "glorot_uniform", "integrability": { "initial": 10, "multiplier": null }, "layer_type": "dense", "nodes_per_layer": [ 32, 8 ], "optimizer": { "clipnorm": 0.000009949736889655482, "learning_rate": 0.001125607109055478, "optimizer_name": "Amsgrad" }, "positivity": { "initial": 52.261799207196255 }, "stopping_patience": 0.11999999999999998 } }, "exp_key": null, "owner": [ "gcn15.local.snellius.surf.nl:3834410" ], "version": 3, "book_time": "2024-03-09 14:56:57.782000", "refresh_time": "2024-03-09 17:02:20.961000" } ```

Question

Is there an easy way to reproduce this trial exactly? i.e. using kfolding, either without hyperopt or with but setting the parameters for the first trial? I know that just setting for instance the epoch range to be min 100 and max 100 will fail the checks.

RoyStegeman commented 5 months ago

It indeed assumes all layers have the same activation.

I'm not sure I understand your last question? Setting the epoch range to min 100 and max 100 will indeed fail all checks, but the number of epochs of this trial is 39703 so setting that as the upper and lower bound should be fine. I guess that where it says "epochs": [28] this is just the tid being printed for whatever reason, and it's not actually being run with that setting.

To reproduce it exactly you could indeed fix the ranges of parameters in the hyperopt runcard such that they are limited to a specific value, setting them equal to the values of this trial.

Alternatively, I suppose you could do a regular fit but with the datasets you would fit in the hyperopt case. This will of course skip the hyperopt-specific computations, but if you don't get the null's in that case you'd at least know it's caused by something specific to running hyperopt and not the hyperparameters+dataset.

APJansen commented 5 months ago

I guess that where it says "epochs": [28] this is just the tid being printed for whatever reason, and it's not actually being run with that setting.

Oh I hadn't even noticed that, no idea what that's about.

To reproduce it exactly you could indeed fix the ranges of parameters in the hyperopt runcard such that they are limited to a specific value, setting them equal to the values of this trial.

My point was that I think setting min and max to be equal of any parameter (epochs was just an example) won't work. I've tried this before with epochs, and it fails one of the n3fit checks. Also for the layers for example, I want it to be 32, 8 but can only set a number of layers and an overall min/max.

Can I just remove parameters from the hyperopt_config completely? I mean will it take them from parameters instead then? In that case I can just leave only say the epochs and set the min to whatever was used in the trial and the max to 1 above that.

APJansen commented 5 months ago

Another requirement to reproduce a trial is of course the hyperopt seed. I remember we discussed this at some point, but I guess we haven't gotten to implementing it so that it is fully reproducible, I see the hyperopt seed is just set to 42. Or am I forgetting something @Cmurilochem?

RoyStegeman commented 5 months ago

I want it to be 32, 8 but can only set a number of layers and an overall min/max.

That's a good point. At some point I do remember running the same configuration many times in a hyperopt setup because I wanted to get a feel for the statistical fluctuations in the hyperopt loss of "good" configurations. I thought that was done by fixing those things in the runcard, but you're right, that is not possible.

Can I just remove parameters from the hyperopt_config completely? I mean will it take them from parameters instead then?

I don't remember if this is an option. If not, you could still overwrite the sampling done to get the layer width by hardcoding instead the settings you want. Since the purpose is to understand where the NaNs are coming from, I don't think we should care too much about being able to reproduce this later on.

Cmurilochem commented 5 months ago

Another requirement to reproduce a trial is of course the hyperopt seed. I remember we discussed this at some point, but I guess we haven't gotten to implementing it so that it is fully reproducible, I see the hyperopt seed is just set to 42. Or am I forgetting something @Cmurilochem?

Hi @APJansen. You are right. We have fixed this to 42.

APJansen commented 5 months ago

The approach I mentioned of just reducing the search space to the epochs and having it pick up the rest from the non-hyperopt settings in the runcard works. The scripts runtrial.slurm along with create_trial_runcard.py here automate rerunning a single trial in hyperopt mode (with the only exception that the seeds are different, I think, so not fully reproducing it).

What happens in this example of trial 28 is that in folds 1 and 2, the warning Nan found, stopping activated is triggered, after 12k and 16k epochs. The loss reported just before is about 1e10, so it is plausible that it has diverged. Although it's not like it was smaller further back.

These warnings correspond exactly (at least, in their counts in the log of the 5 day hyperopt run) to the instances of "Fold <..> finished, loss=nan (...) pass=True".

So I would say that the information that the training blew up is not properly passed on. Ideally we'd stop the trial after one fold hits this and not continue with the next folds.

Can someone look into this? (Our time is really limited now unfortunately)

Radonirinaunimi commented 5 months ago

Can someone look into this? (Our time is really limited now unfortunately)

Hi @APJansen, thanks a lot for the investigation! I can have a look at this but unfortunately this won't be before the end of this week. Are you relying on this to produce more samples?

APJansen commented 5 months ago

Yes @Radonirinaunimi, we aren't running anything for the past 2 weeks. I spent some hours on this and PR #2014 today. The issue here I think is around this line. I've added some monitoring and am rerunning this trial now from PR #2014. If we can decide what criterium to use here and get #2012 and #2016 merged tomorrow we can hopefully start running again.