cctbx / cctbx_project

Computational Crystallography Toolbox
https://cci.lbl.gov/docs/cctbx
Other
218 stars 116 forks source link

XFEL GUI: MPI safe ensemble refinement #840

Closed phyy-nx closed 1 month ago

phyy-nx commented 1 year ago

New command cctbx.xfel.enesemble_refine_pipeline

This is an MPI safe script that runs four programs in sequence:

For the first 3, only mpi rank 0 is used. Then the full MPI job can run for integration. This ensures the first 3 programs are only ran single process.

Also:

Baharis commented 1 year ago

I tried testing some things using this branch, but unfortunately I didn't get to see whether it is MPI safe or not, because... well, TLDR: using my current configuration, ensemble refinement (TDER) does not work altogether without manually patching in some changes to the code. I think this PR would be a good place to introduce them, to make this job functional on Perlmutter.

For context, I was running ensemble refinements on my Perlmutter installation some time now, but in order to get them to work, I needed to hardcode some stuff in the xfel/ui/db/job.py directory. Recently, I have reinstalled cctbx and it looks like I lost the stash with my changes in the process. Unfortunately, I don't remember them exactly. Without my custom commands, I get a Sorry every time I try to sbatch an ensemble refinement job. Both sorry are associated with how the app.params are passed to TDER Script(arguments). I would like to fix it myself, but I don't really see easily how, and I think this might be more clear to you @phyy-nx or @irisdyoung.

The first Sorry: I used to supply in GUI settings MPI command: "srun --cpu-bind=cores", because without the latter part the refinement starts in single process. You can do that for all other jobs, and I do not remember having issues with doing that for TDER but now I end up with Sorry: Unrecognized argument: --cpu-bind=cores. It is because in order to get arguments, a "split" method is called on string, which causes a rouge --cpu-bind=cores to appear in the phil arguments.

The second Sorry comes from basically the same issue, but for extra submission arguments.

I am already a bit tired today, but maybe solution could be easily included in this PR? While writing I think I did that by hard-coding extra arguments and mpi commands after split() is called. I guess a phil string might contain a whitespace, so would fixing this issue be just a matter of writing the file instead of creating it on-the-fly in the code?

Baharis commented 1 year ago

Ok so, as often, the main problem was that I was tired and did not see an obvious solution. The ensemble refinement script indeed did not work, but I pushed some changes to fix "extra options" as well as handling arguments with whitespace. Now it starts as expected, so I am going to run some jobs.

phyy-nx commented 1 year ago

@Baharis Excellent. Looking forward to hearing how testing goes.

Baharis commented 1 year ago

@phyy-nx There is one more little change I would need to be able to run ensemble refinements from GUI without editing the job file. Currently, mp.nnodes is hard-coded as 1 for ensemble refinement, which in my opinion is unrealistic, as using 1 node an ensemble refinement for 250 experiments took 45 minutes. Could I use this PR and your attention to patch in params.mp.nnodes_ensemble and a new "ensemble" SpinCtrl in settings, which would allow changing nnodes from GUI?

phyy-nx commented 1 year ago

@Baharis yah, that this was next on my list so I'd be happy if it went into this PR. It needs to also be added as a tracked parameter in util/mp.py, probably nnodes_ensemble_refinement.

Baharis commented 1 year ago

Added new params.mp.nnodes_tder, necessary hooks, and associated elements in the GUI. Opted for this name for brevity and descriptiveness (the abbreviation is explained with tooltip in GUI).

Baharis commented 1 year ago

using 1 node an ensemble refinement for 250 experiments took 45 minutes.

Small correction: mentioned processing took so long because I reverted to use MPI command "srun" instead of "srun --cpu-bind=cores", which effectively disables multiprocessing on Perlmutter with my settings. After correct binding, the process takes a few minutes as expected, but this requires fix to splitting suggested in this PR. So all good.

Baharis commented 1 year ago

@phyy-nx I spent the whole day on this branch running ensemble refinement with a lot of success. In particular, I tried to break it by my old method of using 128 procs per node, which raised DivideByZero memory errors before, but miraculously does not do it now. I am not sure if this is due to the better structure introduce here or some changes elsewhere, but I could not get the ensemble refinement to crash! Which I guess is a shame, because I didn't get to check the "safeness", but its good to know current structure is robust. Also haven't found any other bugs and the nnodes field works as intended!

Baharis commented 1 year ago

@phyy-nx are there any reasons against merging this PR? I was personally using this branch as default for quite some time now. The conflict seems trivial.

phyy-nx commented 1 year ago

Hi @Baharis let's hold until I get the serialtbx stuff I'm working on in, then we'll rebase this on master and proceed.

Baharis commented 3 months ago

@phyy-nx As per today's discussion, since serialtbx has been merged, I will try to rebase this branch onto the current master and solve the conflicts.

Baharis commented 1 month ago

Again, I advocate for merging this branch into main. Since it was modified slightly in the last 18 months, here is an update on all the features:

mpisafe_ensemble_refinement
phyy-nx commented 1 month ago

Looks good, merging.