coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

At an impasse, need help. #258

Closed adonoho closed 10 months ago

adonoho commented 10 months ago

Gentlefolk,

I have now been struggling for over 7 days to promote a simple embarrassingly parallel app from my local server to use Coiled. This trial is documented here.

My dask.config.config in JSON:

{
    "coiled": {
        "server": "https://cloud.coiled.io",
        "token": "bd5486e86c184d9babc9ddf44a955a3e-e1cf282def14056d5a9dd72e277fe0917bc56286",
        "user": "adonoho",
        "account": null,
        "backend-options": null,
        "no-minimum-version-check": false,
        "protocol": "tls",
        "scheduler-options": {},
        "worker-options": {},
        "wait-for-workers": 0.3,
        "software": null,
        "worker": {
            "cpu": null,
            "gpu": null,
            "memory": null,
            "class": null,
            "vm-types": null,
            "gpu-types": null
        },
        "scheduler": {
            "cpu": null,
            "memory": null,
            "gpu": null,
            "class": null,
            "vm-types": null
        },
        "name": null,
        "shutdown-on-close": true,
        "private-to-creator": false,
        "analytics": {
            "disabled": false,
            "computation": {
                "interval": "15s",
                "code": {
                    "transmit": true
                }
            },
            "profile": {
                "transmit": false,
                "interval": "60s"
            },
            "events": {
                "interval": "10s",
                "allow": [
                    "invalid-worker-transition",
                    "invalid-task-states",
                    "worker-fail-hard"
                ]
            },
            "idle": {
                "timeout": null
            }
        }
    },
    "temporary-directory": null,
    "visualization": {
        "engine": null
    },
    "tokenize": {
        "ensure-deterministic": false
    },
    "dataframe": {
        "backend": "pandas",
        "shuffle": {
            "method": null,
            "compression": null
        },
        "parquet": {
            "metadata-task-size-local": 512,
            "metadata-task-size-remote": 1
        },
        "convert-string": null
    },
    "array": {
        "backend": "numpy",
        "chunk-size": "128MiB",
        "rechunk": {
            "method": "tasks",
            "threshold": 4
        },
        "svg": {
            "size": 120
        },
        "slicing": {
            "split-large-chunks": null
        }
    },
    "optimization": {
        "annotations": {
            "fuse": true
        },
        "fuse": {
            "active": null,
            "ave-width": 1,
            "max-width": null,
            "max-height": Infinity,
            "max-depth-new-edges": null,
            "subgraphs": null,
            "rename-keys": true
        }
    },
    "admin": {
        "traceback": {
            "shorten": {
                "when": [
                    "dask[\\\\\\/]base.py",
                    "distributed[\\\\\\/]client.py"
                ],
                "what": [
                    "dask[\\\\\\/]base.py",
                    "dask[\\\\\\/]core.py",
                    "dask[\\\\\\/]array[\\\\\\/]core.py",
                    "dask[\\\\\\/]optimization.py",
                    "dask[\\\\\\/]dataframe[\\\\\\/]core.py",
                    "dask[\\\\\\/]dataframe[\\\\\\/]methods.py",
                    "dask[\\\\\\/]utils.py",
                    "distributed[\\\\\\/]worker.py",
                    "distributed[\\\\\\/]scheduler.py",
                    "distributed[\\\\\\/]client.py",
                    "distributed[\\\\\\/]utils.py",
                    "tornado[\\\\\\/]gen.py",
                    "pandas[\\\\\\/]core[\\\\\\/]"
                ]
            }
        }
    },
    "distributed": {
        "version": 2,
        "scheduler": {
            "allowed-failures": 3,
            "bandwidth": 100000000,
            "blocked-handlers": [],
            "contact-address": null,
            "default-data-size": "1kiB",
            "events-cleanup-delay": "1h",
            "idle-timeout": null,
            "transition-log-length": 100000,
            "events-log-length": 100000,
            "work-stealing": true,
            "work-stealing-interval": "100ms",
            "worker-saturation": 1.1,
            "worker-ttl": "5 minutes",
            "pickle": true,
            "preload": [],
            "preload-argv": [],
            "unknown-task-duration": "500ms",
            "default-task-durations": {
                "rechunk-split": "1us",
                "split-shuffle": "1us"
            },
            "validate": false,
            "dashboard": {
                "status": {
                    "task-stream-length": 1000
                },
                "tasks": {
                    "task-stream-length": 100000
                },
                "tls": {
                    "ca-file": null,
                    "key": null,
                    "cert": null
                },
                "bokeh-application": {
                    "allow_websocket_origin": [
                        "*"
                    ],
                    "keep_alive_milliseconds": 500,
                    "check_unused_sessions_milliseconds": 500
                }
            },
            "locks": {
                "lease-validation-interval": "10s",
                "lease-timeout": "30s"
            },
            "http": {
                "routes": [
                    "distributed.http.scheduler.prometheus",
                    "distributed.http.scheduler.info",
                    "distributed.http.scheduler.json",
                    "distributed.http.health",
                    "distributed.http.proxy",
                    "distributed.http.statics"
                ]
            },
            "allowed-imports": [
                "dask",
                "distributed"
            ],
            "active-memory-manager": {
                "start": true,
                "interval": "2s",
                "measure": "optimistic",
                "policies": [
                    {
                        "class": "distributed.active_memory_manager.ReduceReplicas"
                    }
                ]
            }
        },
        "worker": {
            "blocked-handlers": [],
            "multiprocessing-method": "spawn",
            "use-file-locking": true,
            "transfer": {
                "message-bytes-limit": "50MB"
            },
            "connections": {
                "outgoing": 50,
                "incoming": 10
            },
            "preload": [],
            "preload-argv": [],
            "daemon": true,
            "validate": false,
            "resources": {},
            "lifetime": {
                "duration": null,
                "stagger": "0 seconds",
                "restart": false
            },
            "profile": {
                "enabled": true,
                "interval": "10ms",
                "cycle": "1000ms",
                "low-level": false
            },
            "memory": {
                "recent-to-old-time": "30s",
                "rebalance": {
                    "measure": "optimistic",
                    "sender-min": 0.3,
                    "recipient-max": 0.6,
                    "sender-recipient-gap": 0.1
                },
                "transfer": 0.1,
                "target": 0.6,
                "spill": 0.7,
                "pause": 0.8,
                "terminate": 0.95,
                "max-spill": false,
                "spill-compression": "auto",
                "monitor-interval": "100ms"
            },
            "http": {
                "routes": [
                    "distributed.http.worker.prometheus",
                    "distributed.http.health",
                    "distributed.http.statics"
                ]
            }
        },
        "nanny": {
            "preload": [],
            "preload-argv": [],
            "environ": {},
            "pre-spawn-environ": {
                "MALLOC_TRIM_THRESHOLD_": 65536,
                "OMP_NUM_THREADS": 1,
                "MKL_NUM_THREADS": 1,
                "OPENBLAS_NUM_THREADS": 1
            }
        },
        "client": {
            "heartbeat": "5s",
            "scheduler-info-interval": "2s",
            "security-loader": null,
            "preload": [],
            "preload-argv": []
        },
        "deploy": {
            "lost-worker-timeout": "15s",
            "cluster-repr-interval": "500ms"
        },
        "adaptive": {
            "interval": "1s",
            "target-duration": "5s",
            "minimum": 0,
            "maximum": Infinity,
            "wait-count": 3
        },
        "comm": {
            "retry": {
                "count": 0,
                "delay": {
                    "min": "1s",
                    "max": "20s"
                }
            },
            "compression": false,
            "shard": "64MiB",
            "offload": "10MiB",
            "default-scheme": "tcp",
            "socket-backlog": 2048,
            "recent-messages-log-length": 0,
            "ucx": {
                "cuda-copy": null,
                "tcp": null,
                "nvlink": null,
                "infiniband": null,
                "rdmacm": null,
                "create-cuda-context": null,
                "environment": {}
            },
            "zstd": {
                "level": 3,
                "threads": 0
            },
            "timeouts": {
                "connect": "30s",
                "tcp": "30s"
            },
            "require-encryption": null,
            "tls": {
                "ciphers": null,
                "min-version": 1.2,
                "max-version": null,
                "ca-file": null,
                "scheduler": {
                    "cert": null,
                    "key": null
                },
                "worker": {
                    "key": null,
                    "cert": null
                },
                "client": {
                    "key": null,
                    "cert": null
                }
            },
            "tcp": {
                "backend": "tornado"
            },
            "websockets": {
                "shard": "8MiB"
            }
        },
        "diagnostics": {
            "nvml": true,
            "computations": {
                "max-history": 100,
                "nframes": 2,
                "ignore-modules": [
                    "distributed",
                    "dask",
                    "xarray",
                    "cudf",
                    "cuml",
                    "prefect",
                    "xgboost",
                    "coiled"
                ]
            },
            "erred-tasks": {
                "max-history": 100
            }
        },
        "dashboard": {
            "link": "{scheme}://{host}:{port}/status",
            "export-tool": false,
            "graph-max-items": 5000,
            "prometheus": {
                "namespace": "dask"
            }
        },
        "admin": {
            "tick": {
                "interval": "20ms",
                "limit": "3s",
                "cycle": "1s"
            },
            "max-error-length": 10000,
            "log-length": 10000,
            "log-format": "%(asctime)s - %(name)s - %(levelname)s - %(message)s",
            "pdb-on-err": false,
            "system-monitor": {
                "interval": "500ms",
                "disk": true,
                "host-cpu": false,
                "gil": {
                    "enabled": true,
                    "interval": "1ms"
                }
            },
            "event-loop": "tornado"
        },
        "rmm": {
            "pool-size": null
        }
    }
}

My environment-coiled.yml is:

name: MatrixRecovery
channels:
  - conda-forge
  - defaults
dependencies:
  - blas[build=mkl]
  - numpy
  - python=3.10
  - icu>=70.1
  - cairo>=1.16.0
  - pandas
  - google-auth
  - pandas-gbq>=0.19.2
  - cvxpy
  - dask
  - coiled
  - sqlalchemy
  - pg8000
  - cloud-sql-python-connector
variables:
  MKL_NUM_THREADS: '1'
  OPENBLAS_NUM_THREADS: '1'
prefix: /Users/awd/opt/anaconda3/envs/MatrixRecovery

Allow me to be adamant in insisting that Coiled should use this file instead of Automatic Package Scanning.

My Coiled specific invocation function:

def do_coiled_experiment():
    exp = test_experiment()
    logging.info(f'{json.dumps(dask.config.config, indent=4)}')
    software_environment = 'adonoho/matrix_recovery'
    logging.info('Deleting environment.')
    coiled.delete_software_environment(software_environment)
    logging.info('Creating environment.')
    coiled.create_software_environment(
        name=software_environment,
        conda="environment-coiled.yml",
        pip=[
            "git+https://GIT_TOKEN@github.com/adonoho/EMS.git"
        ]
    )
    with coiled.Cluster(n_workers=4) as cluster:
        with Client(cluster) as client:
            do_on_cluster(exp, block_bp_instance_df, client, credentials=get_gbq_credentials())

What do_on_cluster() actually does is, at this time moot. The system seems to be unable to select proper packages. Coiled claims the the Environment Build was successful:

[2023-08-21 20:04:54,327][INFO    ][cloud-env.subproc] micromamba run -p /opt/coiled/env python -c "import json, sys; print(json.dumps(sys.path))"
["", "/opt/coiled/env/lib/python310.zip", "/opt/coiled/env/lib/python3.10", "/opt/coiled/env/lib/python3.10/lib-dynload", "/opt/coiled/env/lib/python3.10/site-packages"]
[2023-08-21 20:04:54,593][INFO    ][cloud-env.build] Calculating chunks
[2023-08-21 20:04:55,228][INFO    ][cloud-env.build] Environment size is 4015 MB 
[2023-08-21 20:04:55,228][INFO    ][cloud-env.build] Will split into 71 chunks
[2023-08-21 20:04:55,228][INFO    ][cloud-env.build] Getting upload URLs
[2023-08-21 20:04:55,228][INFO    ][cloud-env.build] Uploading 71 chunks
[2023-08-21 20:04:55,228][INFO    ][cloud-env.build] Requesting multipart upload URLs for 71 chunks
[2023-08-21 20:04:55,786][INFO    ][cloud-env.build] Received upload URLS for 71 chunks
--- Logs end, may be truncated, see https://cloud.coiled.io/software/alias/34821/build/25394?account=adonoho&tab=logs for full output ---
INFO:coiled:Build successful

INFO:coiled:Software environment created

Then the following occurs:

INFO:coiled:Resolving your local Python environment...
INFO:coiled:Creating Cluster (name: adonoho-9d725136-f, https://cloud.coiled.io/clusters/260383?account=adonoho ). This usually takes 1-2 minutes...
ERROR:coiled:   | Worker Process         | adonoho-9d725136-f-worker-34158ad2af           | error      at 15:06:29 (CDT) | Software build failed -> Conda package install failed with the following errors:

package cairo-1.12.18-7 requires icu 56.*, but none of the providers can be installed

Consider creating a new environment.
By specifying your packages at once, you're more likely to get a consistent set of versions.

Clearly, as you seem to have littered my account with a bunch of software environments, most of which have errors and are a sign of my increasing frustration with your service, something isn't right. In specific, it does not appear that you are actually using the software environment that I have specified. Furthermore, the correct versions of cairo and icu are installed.

  cairo-1.16.0-ha61ee94_1014                                  7MB
  icu-70.1-h27087fc_0                                        44MB

Please help.

Anon, Andrew

mrocklin commented 10 months ago

Hi Andrew,

Sorry you've had a frustrating time.

To use an explicit software environment with your cluster you need to specify it with the software= keyword argument.

coiled.create_software_environment(name="myname", ...)
cluster = coiled.Cluster(software="myname", ...)

I'll add something similar to this at the top of https://docs.coiled.io/user_guide/software/manual.html to make it more easy to find.

mrocklin commented 10 months ago

https://github.com/coiled/platform/pull/2799 (for internal users only)

adonoho commented 10 months ago

Trying it now. Thank you.

adonoho commented 10 months ago

Success, the job ran to completion about 4 times faster than my standalone server. Thank you for your help.