NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
5 stars 0 forks source link

performance review of `crew_controller_local` #374

Open mitchellmanware opened 1 month ago

mitchellmanware commented 1 month ago

Now that the pipeline is successfully running through the container, we should do a more detailed review of how crew_controller_local controls the cpu and memory distribution to each working target.

Before containerization, the crew_controller_slurm improved performance drastically due to cpu/memory declaration per controller and therefore balanced workers well.

Previous controller settings:

default_controller <- crew.cluster::crew_controller_slurm(
  name = "default_controller",
  workers = 4,
  seconds_idle = 30,
  slurm_partition = "geo",
  slurm_memory_gigabytes_per_cpu = 4,
  slurm_cpus_per_task = 2,
  script_lines = script_lines
)
calc_controller <- crew.cluster::crew_controller_slurm(
  name = "calc_controller",
  workers = 32,
  seconds_idle = 30,
  slurm_partition = "geo",
  slurm_memory_gigabytes_per_cpu = 8,
  slurm_cpus_per_task = 2,
  script_lines = script_lines
)
nasa_controller <- crew.cluster::crew_controller_slurm(
  name = "nasa_controller",
  workers = 16,
  seconds_idle = 30,
  slurm_partition = "geo",
  slurm_memory_gigabytes_per_cpu = 4,
  slurm_cpus_per_task = 8,
  script_lines = script_lines
)
highmem_controller <- crew.cluster::crew_controller_slurm(
  name = "highmem_controller",
  workers = 1,
  seconds_idle = 30,
  slurm_partition = "highmem",
  slurm_memory_gigabytes_per_cpu = 64,
  slurm_cpus_per_task = 2,
  script_lines = script_lines,
  launch_max = 10
)

Goal is to replicate these types of settings and performance gains via crew_controller_local

mitchellmanware commented 1 month ago

We should also further explore the interaction between future and crew within the containerized environment

sigmafelix commented 3 days ago

@mitchellmanware future and crew (mirai backend, actually) do not seem to work well as mirai daemons do not allow nested parallelism. I think we could divide each target as small as possible such that each target is built fairly quickly. A potential impact of this approach to performance might be a large list and its subsequent impact on merging into a large data.frame (or data.table) object.

mitchellmanware commented 2 days ago

@sigmafelix We have started to implement something similar, breaking the temporal period down into 10/25/50 day chunks (optimal is still TBD) so that each worker runs quickly. This dynamic branching over smaller temporal chunks has shown benefits - an example is calculating NARR covariate for full temporal range and full AQS locations in ~5 minutes.

By using crew local controller it is still unclear if we can indicate a specific amount of memory to a single target. If we can, this should clear issues associated with merging large list into data.fram.