Closed richardt94 closed 2 years ago
This should be ready for merge now.
Or not - still some issues with the nested nonstationary GP code.
Works beautifully for the most part! One niggling issue is with restarts for the McMC. Especially now that we are writing to a single file, it makes sense for us to checkpoint our jobs and restart smaller runs, this will also allow for quicker job execution in the PBS queue.
I think the error is that restarts require a new history function you've written, around here: https://github.com/GeoscienceAustralia/HiQGA.jl/blob/132fdcb8ad561204d2cef091dc2f8b149a455b58/src/TransD_GP_MCMC.jl#L445 and perhaps here https://github.com/GeoscienceAustralia/HiQGA.jl/blob/132fdcb8ad561204d2cef091dc2f8b149a455b58/src/TransD_GP_MCMC.jl#L390 I guess for the second hiccup above, I'd like a history function instead of a direct readdlm for nuisances, which was my original TODO.
Works beautifully for the most part! One niggling issue is with restarts for the McMC. Especially now that we are writing to a single file, it makes sense for us to checkpoint our jobs and restart smaller runs, this will also allow for quicker job execution in the PBS queue.
I think the error is that restarts require a new history function you've written, around here:
and perhaps here
I guess for the second hiccup above, I'd like a history function instead of a direct readdlm for nuisances, which was my original TODO.
Is there a specific script from examples (or something you've tried on NCI) that I can use to debug? I will definitely need to make the restarts compatible with one file, but I could use the example to formulate a test for our CI as well.
So the way I check is, run any example (Line is my favorite because it is fast and checks stat and nonstat) for say 1_001 iterations. Then, run main()
again but with opt.history_mode = "a"
and optlog10lambda = "a"
after the opt are initialized. That will get restart going. For a nuisance inversion, I'd do the same, and also change optn.history_mode = "a"
before running main()
. If it runs all right, there will be no sudden jumps in getchi2forall(opt)
at the restart point.
Aiming to fix this today, should be fairly simple to implement the changes in history
to support reading the new file format
Still a niggling glitch somewhere I think ... on running the stationary line example here for 5001 iterations I get this as expected:
but on setting opt.history_mode = "a"
and restarting ... I think the increment counter for iterations is wrong as I get this on completion:
Finally, on trying to restart SkyTEM at /scratch/z67/ar/test1.6
where the restart
flag is set in the 02
script I get output in the outfile like so, where all soundings do not have the same number of iterations - it should be xxx out of 6002
as the initial run was for 3001 errors :
[ Info: **51.392781019210815**sec** 1001 out of 3003
[ Info: **50.7584969997406**sec** 1001 out of 3006
[ Info: **50.94528889656067**sec** 1001 out of 3002
[ Info: **50.6270649433136**sec** 1001 out of 3002
[ Info: **50.54465293884277**sec** 1001 out of 3002
[ Info: **51.35809898376465**sec** 1001 out of 3005
[ Info: **51.06014108657837**sec** 1001 out of 3004
[ Info: **51.39773917198181**sec** 1001 out of 3002
[ Info: **52.296891927719116**sec** 1001 out of 3005
[ Info: **52.96943187713623**sec** 1001 out of 3002
[ Info: **52.49860501289368**sec** 1001 out of 3004
[ Info: **53.28924918174744**sec** 1001 out of 3004
[ Info: **53.73833394050598**sec** 1001 out of 3005
[ Info: **53.75066614151001**sec** 1001 out of 3005
[ Info: **53.945526123046875**sec** 1001 out of 3006
OK I'll have a look - is the issue just for the "out of \<total>" counter or for the actual "\<i> of" counter as well?
both - it should say something like isample = iterlast+1:iterlast+nsamples
. Variable nsamples
is always, the number of samples the sampler returns. So if nsamples = 5001
and iterlast is 5001
then it should run for 5001 samples more till a total of 10002 samples
This should now work, just needed to change the lines that read the last iteration from the output file when restarting
Almost there ... so without nuisances it works fine ...
continuation for 1D Line underneath is great ...
but for nuisances, there is a problem ... I'm trying out the tempest nuisance_unrot example for only 101 iterations, followed by a restart by setting opt.history_mode=a
.. it still fails with the error
ArgumentError: invalid base 10 digit '.' in "-1.200000e+02"
.
.
@ ~/.julia/dev/HiQGA/src/TransD_GP_MCMC.jl:392
and Line 392 has to do with parsing the nuisance idx
row = findlast(parse.(Int, nuisance_data[:,1]) .== chain_idx)
This should work now, there was a bug where I unintentionally converted the contents of the nuisance file into a 1D array and then attempted to index it like a matrix.
WhooHoo!
To reduce file clutter on NCI. Not ready for merge yet but will keep track of progress here. The new functionality requires completion of several tasks: