coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

parallel particle tracking simulations with Coiled #269

Closed rsignell closed 5 months ago

rsignell commented 5 months ago

I was excited to read https://medium.com/coiled-hq/processing-terabyte-scale-nasa-cloud-datasets-with-coiled-70ab552f35ec blog post by @jrbourbeau because I have basically the same use case: running a bunch of particle tracking simulations that are completely independent: a tracking function with a single argument (start_time), called with a list of different start times.

I used the code in the blog post as a template and created this reproducible notebook (reads from object storage with no credentials needed!)

Running Locally, I successfully ran simulations in parallel using a local client with Dask futures.

Running with Coiled, the code didn't run. I used a Coiled environment that matched my local conda environment (using the same conda environment file) but the code seems to have died in a Scipy LinearNDInterpolator routine?

I'm a bit lost here on what is wrong -- hoping someone the Coiled team can help me out!

phofl commented 5 months ago

Hi,

the worker is paused because memory usage is high, could you maybe try this again with bigger machines to see if this fixes the problem?

The relevant section of the logs:

2024-01-28 17:30:23.4080
scheduler
distributed.worker.memory - WARNING - Worker is at 82% memory usage. Pausing worker.  Process memory: 4.68 GiB -- Worker memory limit: 5.70 GiB
2024-01-28 17:30:23.3160
scheduler
distributed.worker.memory - WARNING - Worker is at 78% memory usage. Resuming worker. Process memory: 4.47 GiB -- Worker memory limit: 5.70 GiB
2024-01-28 17:30:23.0770
scheduler
distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 4.56 GiB -- Worker memory limit: 5.70 GiB
rsignell commented 5 months ago

Thanks for looking into this @phofl ! I thought the above was okay because they were WARNINGs.
A single simulation using Coiled used about 5GB RAM so I thought it would sufficient.

But I ran again specifying at least 8GB RAM https://cloud.coiled.io/clusters/367372/information?account=open-science-computing and those warnings went away but now the simulations got stuck again with:

2024-01-29 07:17:35.8670
10.0.57.202

Dask stopped
Dask process stopped
2024-01-29 07:17:35.8320
10.0.57.202

VM Instance stopped
Dask graceful shutdown
2024-01-29 07:17:04.4520
10.0.57.202

VM Instance stopping
Software environment exited
2024-01-29 07:17:02.8800
10.0.57.202

Dask stopping
Dask process exiting
2024-01-29 07:15:24.1450
10.0.57.202

distributed.worker - WARNING - Compute Failed
Key:       process-24acb7c2-1d4d-4060-9989-174258d03558-1
Function:  process
args:      (datetime.datetime(2012, 1, 29, 0, 0))
kwargs:    {}
2024-01-29 07:15:05.1020
scheduler

distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.02s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2024-01-29 07:15:01.0000
scheduler

distributed.core - INFO - Starting established connection to tls://10.0.62.156:56726
2024-01-29 07:15:00.9990
fjetter commented 5 months ago

I'm not sure why your cluster appeared to be stuck but I can see dying workers due to an unhandled exception in deserialization. This goes back to fsspec and I opened an upstream issue https://github.com/fsspec/filesystem_spec/issues/1520

It looks like you did receive a "compute failed" error log. Did you not receive an exception on client side?

rsignell commented 5 months ago

This is all I see on the client side: https://nbviewer.org/gist/rsignell-usgs/b925c9968fe2db09c0413a080de3e99a

jrbourbeau commented 5 months ago

Hey @rsignell -- glad to hear you found the post useful. Sorry things aren't working out of the box for your application.

Sort of a side comment, but right now the process function is saving a new output file to each time it's run and the available disk on the VMs will eventually fill up

Screenshot 2024-01-29 at 10 14 40 AM

If you write the output file to the temporary directory in the snippet with

ncfile = f'{tmpdir}/opendrift_{start_time.strftime("%Y%m%d")}.nc'

the output files will be cleaned up after each process function call

Is the Xarray dataset returned by process backed by NumPy arrays, or Dask arrays?

rsignell commented 5 months ago

Good to see you here @jrbourbeau!

rsignell commented 5 months ago

Changed both, and disk usage looks better but still failing: https://cloud.coiled.io/clusters/367714/information?account=fathom-mgray&tab=Metrics

phofl commented 5 months ago

I am seeing some File not found errors,

"FileNotFoundError(2, 'No such file or directory')"

Do you want to hop on a call tomorrow to look at this together?

rsignell commented 5 months ago

@phofl Okay, I looked into it and apparently the Open Storage Network pod this data is on is undergoing maintenance/upgrade. :(

I'll copy the data to a more reliable location.

And yes, I would love to hop on a call tomorrow and look at this together! I'm available 9-4pm ET tomorrow.

phofl commented 5 months ago

I sent you an email

rsignell commented 5 months ago

@phofl helped us get this going -- it involved:

  1. Use this conda environment file to create conda environments locally and on coiled:

    mamba env create -f odrift-d7.yml
    coiled env create -n odrift-d7 --conda odrift-d7.yml
  2. Run this notebook locally.