Closed phyy-nx closed 1 month ago
I tried testing some things using this branch, but unfortunately I didn't get to see whether it is MPI safe or not, because... well, TLDR: using my current configuration, ensemble refinement (TDER) does not work altogether without manually patching in some changes to the code. I think this PR would be a good place to introduce them, to make this job functional on Perlmutter.
For context, I was running ensemble refinements on my Perlmutter installation some time now, but in order to get them to work, I needed to hardcode some stuff in the xfel/ui/db/job.py
directory. Recently, I have reinstalled cctbx and it looks like I lost the stash with my changes in the process. Unfortunately, I don't remember them exactly. Without my custom commands, I get a Sorry
every time I try to sbatch an ensemble refinement job. Both sorry are associated with how the app.params
are passed to TDER Script(arguments)
. I would like to fix it myself, but I don't really see easily how, and I think this might be more clear to you @phyy-nx or @irisdyoung.
The first Sorry
: I used to supply in GUI settings MPI command: "srun --cpu-bind=cores", because without the latter part the refinement starts in single process. You can do that for all other jobs, and I do not remember having issues with doing that for TDER but now I end up with Sorry: Unrecognized argument: --cpu-bind=cores
. It is because in order to get arguments, a "split" method is called on string, which causes a rouge --cpu-bind=cores
to appear in the phil arguments
.
The second Sorry
comes from basically the same issue, but for extra submission arguments.
I am already a bit tired today, but maybe solution could be easily included in this PR? While writing I think I did that by hard-coding extra arguments and mpi commands after split()
is called. I guess a phil string might contain a whitespace, so would fixing this issue be just a matter of writing the file instead of creating it on-the-fly in the code?
Ok so, as often, the main problem was that I was tired and did not see an obvious solution. The ensemble refinement script indeed did not work, but I pushed some changes to fix "extra options" as well as handling arguments with whitespace. Now it starts as expected, so I am going to run some jobs.
@Baharis Excellent. Looking forward to hearing how testing goes.
@phyy-nx There is one more little change I would need to be able to run ensemble refinements from GUI without editing the job file. Currently, mp.nnodes
is hard-coded as 1 for ensemble refinement, which in my opinion is unrealistic, as using 1 node an ensemble refinement for 250 experiments took 45 minutes. Could I use this PR and your attention to patch in params.mp.nnodes_ensemble
and a new "ensemble" SpinCtrl in settings, which would allow changing nnodes from GUI?
@Baharis yah, that this was next on my list so I'd be happy if it went into this PR. It needs to also be added as a tracked parameter in util/mp.py, probably nnodes_ensemble_refinement.
Added new params.mp.nnodes_tder
, necessary hooks, and associated elements in the GUI. Opted for this name for brevity and descriptiveness (the abbreviation is explained with tooltip in GUI).
using 1 node an ensemble refinement for 250 experiments took 45 minutes.
Small correction: mentioned processing took so long because I reverted to use MPI command "srun" instead of "srun --cpu-bind=cores", which effectively disables multiprocessing on Perlmutter with my settings. After correct binding, the process takes a few minutes as expected, but this requires fix to splitting suggested in this PR. So all good.
@phyy-nx I spent the whole day on this branch running ensemble refinement with a lot of success. In particular, I tried to break it by my old method of using 128 procs per node, which raised DivideByZero
memory errors before, but miraculously does not do it now. I am not sure if this is due to the better structure introduce here or some changes elsewhere, but I could not get the ensemble refinement to crash! Which I guess is a shame, because I didn't get to check the "safeness", but its good to know current structure is robust. Also haven't found any other bugs and the nnodes field works as intended!
@phyy-nx are there any reasons against merging this PR? I was personally using this branch as default for quite some time now. The conflict seems trivial.
Hi @Baharis let's hold until I get the serialtbx stuff I'm working on in, then we'll rebase this on master and proceed.
@phyy-nx As per today's discussion, since serialtbx has been merged, I will try to rebase this branch onto the current master and solve the conflicts.
Again, I advocate for merging this branch into main
. Since it was modified slightly in the last 18 months, here is an update on all the features:
cctbx.xfel.ensemble_refinement_pipeline
recompute_mosaicity
and merge
frame_extractor
as it is too legacy to keep it goingcctbx.xfel
GUI, add a setting and widget to modify number of nodes used for ensemble refinementLooks good, merging.
New command cctbx.xfel.enesemble_refine_pipeline
This is an MPI safe script that runs four programs in sequence:
For the first 3, only mpi rank 0 is used. Then the full MPI job can run for integration. This ensures the first 3 programs are only ran single process.
Also: