Compare runtime/cost between high-cpu and standard cluster

tomcordruw commented 2 months ago

Run argo workflow processing significant chunk of a dataset (~3 million events) and compare the results and runtime for a standard cluster and one with higher vCPU count.

tomcordruw commented 1 month ago

Edit: Took into account the time for the plotting step in GCS workflows, which is missing in the NFS version. The time shown is now the duration without the plotting step, the total with the plotting step included is shown in brackets.

Configuration:

events: 3000000
jobs: 48
recid: 30544
region: "europe-north1-b"
nodes: 12
disk type: pd-standard

Results: NFS:

e2-standard-4: 4 hours 37 minutes Cost: 7.81 CHF
e2-highcpu-16: 3 hours 16 minutes Cost: 16.38 CHF

GCS Bucket:

argo_bucket_run.yaml:

e2-standard-4: 4 hours 33 minutes (4 hours 57 minutes) Cost: 9.01 CHF
e2-highcpu-16: 3 hours 24 minutes (3 hours 46 minutes) Cost: 18.72 CHF

argo_bucket_upload.yaml:

e2-standard-4: 4 hours 55 minutes (5 hours 17 minutes) Cost: 10.67 CHF
e2-highcpu-16: 3 hours 36 minutes (3 hours 59 minutes) Cost: 18.81 CHF

katilp commented 1 month ago

@tomcordruw Is the time of the bucket workflows without the final plotting step? If not, can you see from the outputs, how long did it take?

tomcordruw commented 1 month ago

@tomcordruw Is the time of the bucket workflows without the final plotting step? If not, can you see from the outputs, how long did it take?

Oh, that would explain it, I didn't realise that step was missing in the nfs workflow. The plotting step is included in the total runtime here, and in the tests it took between 20-25 minutes, which pretty much accounts for the difference.

katilp commented 1 month ago

@tomcordruw Did these jobs run with the image on the node already or does the time include the image pull? We need to have the time without the image pull for a scalable comparison. Currently, the image pull is more than 30 mins and may vary so it can distort the comparison.

tomcordruw commented 1 month ago

@tomcordruw Did these jobs run with the image on the node already or does the time include the image pull? We need to have the time without the image pull for a scalable comparison. Currently, the image pull is more than 30 mins and may vary so it can distort the comparison.

The time unfortunately includes the image pull, but I am currently testing the script after some modifications to initially run the start job and pull the images. From what I can tell, image pulling/pod initialisation takes 31-32 minutes in these configurations which is in line with the difference between the workflows I have been running with/without previously pulling the images.

But of course there can be errors and other things prolonging the image pulling step, so it will be accounted for from now on.

katilp commented 1 month ago

@tomcordruw Is this a fair comparison?

e2-standard-4: 4 vCPUs, 16 GB mem e2-highcpu-16: 16 vCPUs, 16 GB mem.

If N jobs is 48, a 12-node e2-highcpu-16 cluster is mostly idle.

CPU-wise it could have had 12 jobs on each node (0.8 * 16 because we requested 800m CPU) Memory-wise 6. Now it most likely had 4 only (if the 48 jobs were evenly distributed to nodes), or many nodes were idle. And the cost goes with the time, not with the occupancy.

A fair comparison would be how many events / hour we can get with the maximum occupancy.

For memory requests, as seen in https://github.com/cms-dpoa/cloud-processing/issues/49#issuecomment-2363148448 we could most likely set it lower than 2.3GB, e.g. 1.5GB would allow ~ 10 jobs/node

tomcordruw commented 1 month ago

@katilp Indeed, what I'm seeing supports what you're writing. And yes, the cost is based on time, not resource usage, so I will try lowering the resource requests and see how well the highcpu-clusters can be utilised that way.

The resource usage I'm getting so far indicates 1:2 ratio for vCPU (800m) to memory (~1.6GB) for each job. While there is no fitting e2 machine type (standard is 1:4 and highcpu is 1:1), it can be achieved with custom machine types, so that way we could lower the amount of unused resources.

katilp commented 1 month ago

Right, but the first thing is to have a big enough number of jobs so that it really fills the cluster. The number of jobs might need to be different to compare the two types of clusters. Or less nodes in the high-CPU cluster. What matters is the total number of CPUs.

tomcordruw commented 1 month ago

Right, so e.g. for 12 e2-highcpu-16 nodes, after adjusting resource requests, it should allow 10 jobs per node, meaning 120 jobs total, or alternatively 5 nodes for 48 jobs to have a fair comparison?

katilp commented 1 month ago

Right, so e.g. for 12 e2-highcpu-16 nodes, after adjusting resource requests, it should allow 10 jobs per node, meaning 120 jobs total, or alternatively 5 nodes for 48 jobs to have a fair comparison?

Yes, something like this. It probably requires some manual inspection. Best to start a workflow and see how they go. If there are "left-overs", i.e. jobs that do not fit running parallel, then decrease the number of jobs so that all pfnano steps go at the same time.

tomcordruw commented 1 month ago

Okay, seems clear to me now! I will do some runs and inspect how things behave and update the comparisons accordingly.

cms-dpoa / cloud-processing

Compare runtime/cost between high-cpu and standard cluster #46