coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

At an impasse, need help. #258

Closed adonoho closed 10 months ago

adonoho commented 10 months ago


I have now been struggling for over 7 days to promote a simple embarrassingly parallel app from my local server to use Coiled. This trial is documented here.

My dask.config.config in JSON:

    "coiled": {
        "server": "",
        "token": "bd5486e86c184d9babc9ddf44a955a3e-e1cf282def14056d5a9dd72e277fe0917bc56286",
        "user": "adonoho",
        "account": null,
        "backend-options": null,
        "no-minimum-version-check": false,
        "protocol": "tls",
        "scheduler-options": {},
        "worker-options": {},
        "wait-for-workers": 0.3,
        "software": null,
        "worker": {
            "cpu": null,
            "gpu": null,
            "memory": null,
            "class": null,
            "vm-types": null,
            "gpu-types": null
        "scheduler": {
            "cpu": null,
            "memory": null,
            "gpu": null,
            "class": null,
            "vm-types": null
        "name": null,
        "shutdown-on-close": true,
        "private-to-creator": false,
        "analytics": {
            "disabled": false,
            "computation": {
                "interval": "15s",
                "code": {
                    "transmit": true
            "profile": {
                "transmit": false,
                "interval": "60s"
            "events": {
                "interval": "10s",
                "allow": [
            "idle": {
                "timeout": null
    "temporary-directory": null,
    "visualization": {
        "engine": null
    "tokenize": {
        "ensure-deterministic": false
    "dataframe": {
        "backend": "pandas",
        "shuffle": {
            "method": null,
            "compression": null
        "parquet": {
            "metadata-task-size-local": 512,
            "metadata-task-size-remote": 1
        "convert-string": null
    "array": {
        "backend": "numpy",
        "chunk-size": "128MiB",
        "rechunk": {
            "method": "tasks",
            "threshold": 4
        "svg": {
            "size": 120
        "slicing": {
            "split-large-chunks": null
    "optimization": {
        "annotations": {
            "fuse": true
        "fuse": {
            "active": null,
            "ave-width": 1,
            "max-width": null,
            "max-height": Infinity,
            "max-depth-new-edges": null,
            "subgraphs": null,
            "rename-keys": true
    "admin": {
        "traceback": {
            "shorten": {
                "when": [
                "what": [
    "distributed": {
        "version": 2,
        "scheduler": {
            "allowed-failures": 3,
            "bandwidth": 100000000,
            "blocked-handlers": [],
            "contact-address": null,
            "default-data-size": "1kiB",
            "events-cleanup-delay": "1h",
            "idle-timeout": null,
            "transition-log-length": 100000,
            "events-log-length": 100000,
            "work-stealing": true,
            "work-stealing-interval": "100ms",
            "worker-saturation": 1.1,
            "worker-ttl": "5 minutes",
            "pickle": true,
            "preload": [],
            "preload-argv": [],
            "unknown-task-duration": "500ms",
            "default-task-durations": {
                "rechunk-split": "1us",
                "split-shuffle": "1us"
            "validate": false,
            "dashboard": {
                "status": {
                    "task-stream-length": 1000
                "tasks": {
                    "task-stream-length": 100000
                "tls": {
                    "ca-file": null,
                    "key": null,
                    "cert": null
                "bokeh-application": {
                    "allow_websocket_origin": [
                    "keep_alive_milliseconds": 500,
                    "check_unused_sessions_milliseconds": 500
            "locks": {
                "lease-validation-interval": "10s",
                "lease-timeout": "30s"
            "http": {
                "routes": [
            "allowed-imports": [
            "active-memory-manager": {
                "start": true,
                "interval": "2s",
                "measure": "optimistic",
                "policies": [
                        "class": "distributed.active_memory_manager.ReduceReplicas"
        "worker": {
            "blocked-handlers": [],
            "multiprocessing-method": "spawn",
            "use-file-locking": true,
            "transfer": {
                "message-bytes-limit": "50MB"
            "connections": {
                "outgoing": 50,
                "incoming": 10
            "preload": [],
            "preload-argv": [],
            "daemon": true,
            "validate": false,
            "resources": {},
            "lifetime": {
                "duration": null,
                "stagger": "0 seconds",
                "restart": false
            "profile": {
                "enabled": true,
                "interval": "10ms",
                "cycle": "1000ms",
                "low-level": false
            "memory": {
                "recent-to-old-time": "30s",
                "rebalance": {
                    "measure": "optimistic",
                    "sender-min": 0.3,
                    "recipient-max": 0.6,
                    "sender-recipient-gap": 0.1
                "transfer": 0.1,
                "target": 0.6,
                "spill": 0.7,
                "pause": 0.8,
                "terminate": 0.95,
                "max-spill": false,
                "spill-compression": "auto",
                "monitor-interval": "100ms"
            "http": {
                "routes": [
        "nanny": {
            "preload": [],
            "preload-argv": [],
            "environ": {},
            "pre-spawn-environ": {
                "MALLOC_TRIM_THRESHOLD_": 65536,
                "OMP_NUM_THREADS": 1,
                "MKL_NUM_THREADS": 1,
                "OPENBLAS_NUM_THREADS": 1
        "client": {
            "heartbeat": "5s",
            "scheduler-info-interval": "2s",
            "security-loader": null,
            "preload": [],
            "preload-argv": []
        "deploy": {
            "lost-worker-timeout": "15s",
            "cluster-repr-interval": "500ms"
        "adaptive": {
            "interval": "1s",
            "target-duration": "5s",
            "minimum": 0,
            "maximum": Infinity,
            "wait-count": 3
        "comm": {
            "retry": {
                "count": 0,
                "delay": {
                    "min": "1s",
                    "max": "20s"
            "compression": false,
            "shard": "64MiB",
            "offload": "10MiB",
            "default-scheme": "tcp",
            "socket-backlog": 2048,
            "recent-messages-log-length": 0,
            "ucx": {
                "cuda-copy": null,
                "tcp": null,
                "nvlink": null,
                "infiniband": null,
                "rdmacm": null,
                "create-cuda-context": null,
                "environment": {}
            "zstd": {
                "level": 3,
                "threads": 0
            "timeouts": {
                "connect": "30s",
                "tcp": "30s"
            "require-encryption": null,
            "tls": {
                "ciphers": null,
                "min-version": 1.2,
                "max-version": null,
                "ca-file": null,
                "scheduler": {
                    "cert": null,
                    "key": null
                "worker": {
                    "key": null,
                    "cert": null
                "client": {
                    "key": null,
                    "cert": null
            "tcp": {
                "backend": "tornado"
            "websockets": {
                "shard": "8MiB"
        "diagnostics": {
            "nvml": true,
            "computations": {
                "max-history": 100,
                "nframes": 2,
                "ignore-modules": [
            "erred-tasks": {
                "max-history": 100
        "dashboard": {
            "link": "{scheme}://{host}:{port}/status",
            "export-tool": false,
            "graph-max-items": 5000,
            "prometheus": {
                "namespace": "dask"
        "admin": {
            "tick": {
                "interval": "20ms",
                "limit": "3s",
                "cycle": "1s"
            "max-error-length": 10000,
            "log-length": 10000,
            "log-format": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
            "pdb-on-err": false,
            "system-monitor": {
                "interval": "500ms",
                "disk": true,
                "host-cpu": false,
                "gil": {
                    "enabled": true,
                    "interval": "1ms"
            "event-loop": "tornado"
        "rmm": {
            "pool-size": null

My environment-coiled.yml is:

name: MatrixRecovery
  - conda-forge
  - defaults
  - blas[build=mkl]
  - numpy
  - python=3.10
  - icu>=70.1
  - cairo>=1.16.0
  - pandas
  - google-auth
  - pandas-gbq>=0.19.2
  - cvxpy
  - dask
  - coiled
  - sqlalchemy
  - pg8000
  - cloud-sql-python-connector
prefix: /Users/awd/opt/anaconda3/envs/MatrixRecovery

Allow me to be adamant in insisting that Coiled should use this file instead of Automatic Package Scanning.

My Coiled specific invocation function:

def do_coiled_experiment():
    exp = test_experiment()'{json.dumps(dask.config.config, indent=4)}')
    software_environment = 'adonoho/matrix_recovery''Deleting environment.')
    coiled.delete_software_environment(software_environment)'Creating environment.')
    with coiled.Cluster(n_workers=4) as cluster:
        with Client(cluster) as client:
            do_on_cluster(exp, block_bp_instance_df, client, credentials=get_gbq_credentials())

What do_on_cluster() actually does is, at this time moot. The system seems to be unable to select proper packages. Coiled claims the the Environment Build was successful:

[2023-08-21 20:04:54,327][INFO    ][cloud-env.subproc] micromamba run -p /opt/coiled/env python -c "import json, sys; print(json.dumps(sys.path))"
["", "/opt/coiled/env/lib/", "/opt/coiled/env/lib/python3.10", "/opt/coiled/env/lib/python3.10/lib-dynload", "/opt/coiled/env/lib/python3.10/site-packages"]
[2023-08-21 20:04:54,593][INFO    ][] Calculating chunks
[2023-08-21 20:04:55,228][INFO    ][] Environment size is 4015 MB 
[2023-08-21 20:04:55,228][INFO    ][] Will split into 71 chunks
[2023-08-21 20:04:55,228][INFO    ][] Getting upload URLs
[2023-08-21 20:04:55,228][INFO    ][] Uploading 71 chunks
[2023-08-21 20:04:55,228][INFO    ][] Requesting multipart upload URLs for 71 chunks
[2023-08-21 20:04:55,786][INFO    ][] Received upload URLS for 71 chunks
--- Logs end, may be truncated, see for full output ---
INFO:coiled:Build successful

INFO:coiled:Software environment created

Then the following occurs:

INFO:coiled:Resolving your local Python environment...
INFO:coiled:Creating Cluster (name: adonoho-9d725136-f, ). This usually takes 1-2 minutes...
ERROR:coiled:   | Worker Process         | adonoho-9d725136-f-worker-34158ad2af           | error      at 15:06:29 (CDT) | Software build failed -> Conda package install failed with the following errors:

package cairo-1.12.18-7 requires icu 56.*, but none of the providers can be installed

Consider creating a new environment.
By specifying your packages at once, you're more likely to get a consistent set of versions.

Clearly, as you seem to have littered my account with a bunch of software environments, most of which have errors and are a sign of my increasing frustration with your service, something isn't right. In specific, it does not appear that you are actually using the software environment that I have specified. Furthermore, the correct versions of cairo and icu are installed.

  cairo-1.16.0-ha61ee94_1014                                  7MB
  icu-70.1-h27087fc_0                                        44MB

Please help.

Anon, Andrew

mrocklin commented 10 months ago

Hi Andrew,

Sorry you've had a frustrating time.

To use an explicit software environment with your cluster you need to specify it with the software= keyword argument.

coiled.create_software_environment(name="myname", ...)
cluster = coiled.Cluster(software="myname", ...)

I'll add something similar to this at the top of to make it more easy to find.

mrocklin commented 10 months ago (for internal users only)

adonoho commented 10 months ago

Trying it now. Thank you.

adonoho commented 10 months ago

Success, the job ran to completion about 4 times faster than my standalone server. Thank you for your help.