Added memory to the Delta Mako file to run batch runs

mrodrig6 commented 4 months ago

Description

Please include a summary of the changes and the related issue(s) if they exist. Please also include relevant motivation and context.

Fixes #(issue) [optional]

Type of change

Please delete options that are not relevant.

[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Something else

Scope

[ ] This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

[ ] Test A
[ ] Test B

Test Configuration:

What computers and compilers did you use to test this:

Checklist

[ ] I have added comments for the new code
[ ] I added Doxygen docstrings to the new code
[ ] I have made corresponding changes to the documentation (docs/)
[ ] I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
[ ] I have added example cases in examples/ that demonstrate my new feature performing as expected
[ ] I ran ./mfc.sh format before committing my code
[ ] New and existing tests pass locally with my changes, including with GPU capability enabled and disabled
[ ] This PR does not introduce any repeated code (it follows the DRY principle)
[ ] I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in `src/`)

To make sure the code is performing as expected on GPU devices, I have:

[ ] Checked that the code compiles using NVHPC compilers
[ ] Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
[ ] Enclosed the new feature via nvtx ranges so that they can be identified in profiles
[ ] Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
[ ] Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

sbryngelson commented 4 months ago

@mrodrig6 what exactly does this do? (please fill out the PR form). how do you know it works? what doesn't work in the current version? @wilfonba you ran some big jobs on Delta, did you find a problem?

wilfonba commented 4 months ago

I ran 2.25 billion cells in batch mode on 32 nodes (128 A100s) with the current mako file. The only thing I had to change was 4 CPU cores per task. Although I guess I should have shared that i needed to do this with Spencer before now.

sbryngelson commented 4 months ago

I ran 2.25 billion cells in batch mode on 32 nodes (128 A100s) with the current mako file. The only thing I had to change was 4 CPU cores per task. Although I guess I should have shared that i needed to do this with Spencer before now.

So how did you go about changing this?

wilfonba commented 4 months ago

I changed it manually in the autogenerated batch script. Not exactly the most optimal way to do it.

sbryngelson commented 4 months ago

I changed it manually in the autogenerated batch script. Not exactly the most optimal way to do it.

I just meant what is the literal SLURM script line

wilfonba commented 4 months ago

#SBATCH --cpus-per-task=4. I think this could also be fixed in how Mauro is suggested by requesting more memory per core.

sbryngelson commented 4 months ago

I'm still unclear on what the right thing to do here is. Someone will need to explain to me what's going on in this PR and pros/cons (to the extent they exist).

wilfonba commented 3 months ago

Replacing --mem=208G with --mem=0 will always request all memory available on a node. This will cause longer wait times for interactive jobs using less than 4 GPUs because the scheduler will wait until a node with all of its memory is available. It might also charge for all 4 GPUs. I'm not sure how Delta handles this case.

sbryngelson commented 3 months ago

I don't see why we are specifying memory at all. SLURM should handle this on its own if you specify all other requested resources (CPU cores and GPU devices). I would be quite shocked if this was not the case.

wilfonba commented 3 months ago

Based on @haochey slack message, DELTA's slurm setup defaults to 1G per core. That fact that I had to specify 4 CPU cores per GPU for the airfoil simulation to avoid out of memory errors suggests that DELTA is doing something weird in regards to memory per core. However 4 CPU cores per GPU didn't map to 1G per core for that case, so I'm not entirely sure what the case is.

sbryngelson commented 3 months ago

can we specify memory per core, then?

wilfonba commented 3 months ago

--mem-per-cpu specifies memory per CPU core in SLURM. It's still odd that it's a problem at all though. This isn't a problem on any other machines. I'm guessing it has to do with UIUC setting a default that encourages full utilization of shared nodes. Ie if someone has an AI workload with a lot of CPU to GPU memory transfer (more CPU memory needed than GPU memory), the memory per core requirement would be higher than if someone had a workload that rarely required CPU memory transfer. This seems like an edge case, unless Delta goes all the way to putting multiple jobs on the same GPU if they don't request the full memory. I'm not entirely sure.

sbryngelson commented 3 months ago

I'm going to merge this since it seems necessary to run any meaningful jobs and I don't know what the alternative is.

MFlowCode / MFC