GeoscienceAustralia / HiQGA.jl

High Quality Geophysical Analysis provides a general purpose Bayesian and deterministic inversion framework for various geophysical methods and spatially distributed / timeseries data
MIT License
38 stars 5 forks source link

Share output files between chains #26

Closed richardt94 closed 2 years ago

richardt94 commented 2 years ago

To reduce file clutter on NCI. Not ready for merge yet but will keep track of progress here. The new functionality requires completion of several tasks:

  1. The "master process" that coordinates the chain sampler workers should handle all I/O to avoid unsafe writes of the same file by multiple processes. This work is mostly complete, implemented by having the worker processes send messages back to a "write log" function that only ever executes on the master.
  2. The history reading functionality needs to be updated to support reading multiple chains of output from a single file. This also will require changing the output format to include a "chain identifier" column to avoid ambiguous output.
  3. Backwards compatibility with old inversion results spread across multiple files should be maintained. It's probably easiest to do this by automatically detecting whether single or multiple files exist in the output directory and dispatching to an appropriate function to handle the reads.
richardt94 commented 2 years ago

This should be ready for merge now.

richardt94 commented 2 years ago

Or not - still some issues with the nested nonstationary GP code.

a2ray commented 2 years ago

Works beautifully for the most part! One niggling issue is with restarts for the McMC. Especially now that we are writing to a single file, it makes sense for us to checkpoint our jobs and restart smaller runs, this will also allow for quicker job execution in the PBS queue.

I think the error is that restarts require a new history function you've written, around here: https://github.com/GeoscienceAustralia/HiQGA.jl/blob/132fdcb8ad561204d2cef091dc2f8b149a455b58/src/TransD_GP_MCMC.jl#L445 and perhaps here https://github.com/GeoscienceAustralia/HiQGA.jl/blob/132fdcb8ad561204d2cef091dc2f8b149a455b58/src/TransD_GP_MCMC.jl#L390 I guess for the second hiccup above, I'd like a history function instead of a direct readdlm for nuisances, which was my original TODO.

richardt94 commented 2 years ago

Works beautifully for the most part! One niggling issue is with restarts for the McMC. Especially now that we are writing to a single file, it makes sense for us to checkpoint our jobs and restart smaller runs, this will also allow for quicker job execution in the PBS queue.

I think the error is that restarts require a new history function you've written, around here:

https://github.com/GeoscienceAustralia/HiQGA.jl/blob/132fdcb8ad561204d2cef091dc2f8b149a455b58/src/TransD_GP_MCMC.jl#L445

and perhaps here

https://github.com/GeoscienceAustralia/HiQGA.jl/blob/132fdcb8ad561204d2cef091dc2f8b149a455b58/src/TransD_GP_MCMC.jl#L390

I guess for the second hiccup above, I'd like a history function instead of a direct readdlm for nuisances, which was my original TODO.

Is there a specific script from examples (or something you've tried on NCI) that I can use to debug? I will definitely need to make the restarts compatible with one file, but I could use the example to formulate a test for our CI as well.

a2ray commented 2 years ago

So the way I check is, run any example (Line is my favorite because it is fast and checks stat and nonstat) for say 1_001 iterations. Then, run main() again but with opt.history_mode = "a" and optlog10lambda = "a" after the opt are initialized. That will get restart going. For a nuisance inversion, I'd do the same, and also change optn.history_mode = "a" before running main(). If it runs all right, there will be no sudden jumps in getchi2forall(opt) at the restart point.

richardt94 commented 2 years ago

Aiming to fix this today, should be fairly simple to implement the changes in history to support reading the new file format

a2ray commented 2 years ago

Still a niggling glitch somewhere I think ... on running the stationary line example here for 5001 iterations I get this as expected: image but on setting opt.history_mode = "a" and restarting ... I think the increment counter for iterations is wrong as I get this on completion: image Finally, on trying to restart SkyTEM at /scratch/z67/ar/test1.6 where the restart flag is set in the 02 script I get output in the outfile like so, where all soundings do not have the same number of iterations - it should be xxx out of 6002 as the initial run was for 3001 errors :

[ Info: **51.392781019210815**sec** 1001 out of 3003
[ Info: **50.7584969997406**sec** 1001 out of 3006
[ Info: **50.94528889656067**sec** 1001 out of 3002
[ Info: **50.6270649433136**sec** 1001 out of 3002
[ Info: **50.54465293884277**sec** 1001 out of 3002
[ Info: **51.35809898376465**sec** 1001 out of 3005
[ Info: **51.06014108657837**sec** 1001 out of 3004
[ Info: **51.39773917198181**sec** 1001 out of 3002
[ Info: **52.296891927719116**sec** 1001 out of 3005
[ Info: **52.96943187713623**sec** 1001 out of 3002
[ Info: **52.49860501289368**sec** 1001 out of 3004
[ Info: **53.28924918174744**sec** 1001 out of 3004
[ Info: **53.73833394050598**sec** 1001 out of 3005
[ Info: **53.75066614151001**sec** 1001 out of 3005
[ Info: **53.945526123046875**sec** 1001 out of 3006
richardt94 commented 2 years ago

OK I'll have a look - is the issue just for the "out of \<total>" counter or for the actual "\<i> of" counter as well?

a2ray commented 2 years ago

both - it should say something like isample = iterlast+1:iterlast+nsamples . Variable nsamples is always, the number of samples the sampler returns. So if nsamples = 5001 and iterlast is 5001 then it should run for 5001 samples more till a total of 10002 samples

richardt94 commented 2 years ago

This should now work, just needed to change the lines that read the last iteration from the output file when restarting

a2ray commented 2 years ago

Almost there ... so without nuisances it works fine ... continuation for 1D Line underneath is great ... image but for nuisances, there is a problem ... I'm trying out the tempest nuisance_unrot example for only 101 iterations, followed by a restart by setting opt.history_mode=a .. it still fails with the error

ArgumentError: invalid base 10 digit '.' in "-1.200000e+02"
.
.
    @ ~/.julia/dev/HiQGA/src/TransD_GP_MCMC.jl:392

and Line 392 has to do with parsing the nuisance idx

            row = findlast(parse.(Int, nuisance_data[:,1]) .== chain_idx)
richardt94 commented 2 years ago

This should work now, there was a bug where I unintentionally converted the contents of the nuisance file into a 1D array and then attempted to index it like a matrix.

a2ray commented 2 years ago

WhooHoo!