Feature for setting # of allocated CPUs

jmason42 commented 6 years ago

I think I need a feature for requesting more CPUs when running certain intensive jobs e.g. unusually long simulations. Ideally we'd address this by optimizing associated code (particularly analysis scripts) and fixing memory leaks but in the short-term I'm already manually bumping up my standard allocation. If I'm ever to get my #193 branch in a state where it can be reasonably used by others I will probably need a way to programatically request more CPUs.

Obviously the way to do this would be to add more options to fw_queue.py. ATM I'm thinking

an optional CPU # for simulations (default is 1)
an optional CPU # for each analysis type
- single; default 1
- cohort; default 2
- variant; default 2
- multigen; default 2

The non-single analysis scripts are harder to select defaults for, since the scale of the memory they need depends partly on the number of variants/generations/etc. run.

Anyway, looking for feedback, particularly on whether this is something we want, and further what features you all would or would not want. Five new options is a lot.

tahorst commented 6 years ago

I think it's better to be able to adjust the resource requirement on the fly. You won't necessarily know how many cpus you need for the analysis until you run into the limit after running all the sims. If we specify it upfront, I think it becomes more difficult to change and you can't just adjust your qadapter file. I think it's best to keep it as is and allow for manual adjustment when needed.

jmason42 commented 6 years ago

The issue is that I don't need all those cores when running some simulations, only when running the simulations where I disable parts of the fitter. That means manually editing my_qadapter.yaml between runs, or using way more resources than I need. I agree that there's no good way to know upfront how many you need, but at least once I do know, this would let me set it up permanently in my runscripts.

If you're uncomfortable with the defaults, I can change those to None and let them fall through to whatever my_qadapter.yaml provides. However I think it's kind of silly that our analysis is just outright broken unless you know what line to add to my_qadapter.yaml, even in very simple cases (a couple generations, a few different seeds, etc.).

tahorst commented 6 years ago

If you want to add the option then go ahead but I think default behavior needs to stay the same. If we're adding it, then being able to specify each analysis type individually is the way to go

1fish2 commented 6 years ago

A feature to allocate CPUs makes sense, presumably that mixes manual allocation/advice with automatic allocation which responds to the hardware. Someday we'll run this on a server cluster!

wholecell/utils/parallelization.py is a tiny first start. The fact that it's way more comments than code says something about the complexity of SLURM.

If the workflow runs all analyses of one type at a time on one Sherlock node, then it "just" has to decide between running multiple analyses in parallel vs. allocating multiple CPUs to one or two analyses.

@jmason42 why does the simulation need more CPUs when disabling parts of the fitter?

jmason42 commented 6 years ago

@jmason42 why does the simulation need more CPUs when disabling parts of the fitter?

Good question. The parts of the fitter I am disabling fit for the expression of ribosomes and RNA polymerases, both of which are key elements in the main 'loop' of our simulation (the central dogma). The un-fit expression of these elements cause insufficient production of cell components, including future ribosomes and RNA polymerases; consequently the cells grow slower (much slower if I disable ribosome fitting). Through simulation I find that the growth rate slows to almost a third of the target growth rate (i.e. the growth rate we get under a normal, fit simulation). Since the output of a simulation is proportional to the number of time steps simulated, these simulations are about three times larger than usual, and consequently any analysis script may need just as much more memory.

The real issue here is that our analysis scripts use way more memory than necessary, which is sometimes a fundamental matplotlib problem, sometimes an issue of plotting an unreasonable number of points (e.g. more than one can visually distinguish), but primarily an issue of loading big chunks of simulation output and then discarding >99% of it. This is probably 'easy' to fix by adding more features to the TableReader class. @tahorst and I have talked about this problem but I don't think anyone has opened an issue, so I'll get on that.

CovertLab / wcEcoli

Feature for setting # of allocated CPUs #217