Exploring variability in identical workflows

rsignell commented 1 month ago

I ran the same workflow twice, which just extracts a time series from a collection of files in object storage.

The first time I ran the workflow, all the tasks except for one finished in 30s or so, but the last task didn't complete for another 2 minutes: https://cloud.coiled.io/clusters/549132/account/esip-lab/information?organization=esip-lab

I then ran the workflow again and the whole thing completed in 30s: https://cloud.coiled.io/clusters/549139/account/esip-lab/information?organization=esip-lab

Is there a way to use the Coiled diagnostics to help figure out what was going on in the first case?

Obviously I'd like to avoid having all the workers running doing nothing while one task takes a super long time!

The reproducible notebook is here: https://nbviewer.org/gist/rsignell/8321c6e3f8f30ec70cdb6d768734e458

ntabris commented 1 month ago

Hi @rsignell.

Not sure if this is the root cause, but one thing that might be relevant is that in both cases, the cluster started doing work while some of the workers were still coming up.

There's some expected variance in how long workers take to come up and be ready, here's what I see on your two clusters:

This can then affect how tasks get distributed.

For large/long-running workloads I'd expect this to make less of a relative difference, but since you're just running a few minutes of work I wouldn't be surprised if this makes a bigger difference.

If you want to make this more consistent for comparative benchmarks on small workloads, you might try using coiled.Cluster(..., wait_for_workers=True) to wait for all the workers before cluster is considered "ready".

Does that help / address what you're looking for?

mrocklin commented 1 month ago

When I see long stragglers like this my default assumption is variance in S3 response times. Some S3 requests just take a minute or two for some reason.

phofl commented 1 month ago

Yeah, cluster boot up seems unlikely here. The task was just running for a very long time, not sure why though. We don't see any network traffic either, so s3 seems unlikely as well.

rsignell commented 1 month ago

@martindurant, in your dealings with fsspec and GCS, have you occasionally seen requests to object storage taking much longer than the rest? I remember vaguely probing this issue with AWS and S3, where we were considering setting fsspec parameters like:

fs_aws = fsspec.filesystem('s3', anon=True, 
         config_kwargs={'connect_timeout':5, 
                         'read_timeout':5,
                         'retries':{'max_attempts': 10}})

not sure if there are similar settings for GCS... I will check...

martindurant commented 1 month ago

Some people have commented that requests can take a long time to fail ( https://github.com/fsspec/gcsfs/issues/633 ) but no, I don't hear much about occasional stragglers. I think it's assumed a fair price to pay.

I should be possible to timeout requests by passing similar arguments to aiohttp, and you could combine this with gcsfs retries or dask retries. I can't say whether that really gets you the bytes faster or not.

What we are really missing is a good async-> parallel model here. It should be possible to launch a large number of IO requests at once, and farm out the CPU-bound processing of bytes as they arrive in, but this is complex to do across a cluster. An interesting idea among some grib-oriented peoeple was to have an IO dedicated machine on the cluster with a fat NIC, which does light processing on the incoming data and provides data to other machines on the internal network for the real work.

coiled / feedback

Exploring variability in identical workflows #295