coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Backwards incompatible change on server side #216

Closed fjetter closed 2 years ago

fjetter commented 2 years ago

It appears we rolled out something on server side that made the client incompatible.

I was running version 0.2.38 and got the error below when creating a cluster

coiled.Cluster(package_sync=True)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 cluster = coiled.Cluster(package_sync=True)

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/coiled/_beta/cluster.py:425, in ClusterBeta.__init__(self, name, software, n_workers, worker_class, worker_options, worker_vm_types, worker_cpu, worker_memory, worker_disk_size, worker_gpu, worker_gpu_type, scheduler_class, scheduler_options, scheduler_vm_types, scheduler_cpu, scheduler_memory, asynchronous, cloud, account, shutdown_on_close, use_scheduler_public_ip, credentials, timeout, environ, tags, send_dask_config, backend_options, show_widget, configure_logging, wait_for_workers, package_sync, package_sync_strict, package_sync_fail_on, allow_ingress_from, allow_ssh)
    423     error = e
    424     self.close()
--> 425     raise e
    426 finally:
    427     if error:

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/coiled/_beta/cluster.py:407, in ClusterBeta.__init__(self, name, software, n_workers, worker_class, worker_options, worker_vm_types, worker_cpu, worker_memory, worker_disk_size, worker_gpu, worker_gpu_type, scheduler_class, scheduler_options, scheduler_vm_types, scheduler_cpu, scheduler_memory, asynchronous, cloud, account, shutdown_on_close, use_scheduler_public_ip, credentials, timeout, environ, tags, send_dask_config, backend_options, show_widget, configure_logging, wait_for_workers, package_sync, package_sync_strict, package_sync_fail_on, allow_ingress_from, allow_ssh)
    405 error = None
    406 try:
--> 407     self.sync(self._start)
    408 except (ClusterCreationError, InstanceTypeError) as e:
    409     error = e

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/coiled/cluster.py:514, in Cluster.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    506 def sync(
    507     self,
    508     func: Callable[..., Awaitable[_T]],
   (...)
    512     **kwargs,
    513 ) -> Union[_T, Awaitable[_T]]:
--> 514     return super().sync(
    515         func,
    516         *args,
    517         asynchronous=asynchronous,
    518         callback_timeout=callback_timeout,
    519         **kwargs,
    520     )

File ~/workspace/distributed/distributed/utils.py:339, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    337     return future
    338 else:
--> 339     return sync(
    340         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    341     )

File ~/workspace/distributed/distributed/utils.py:406, in sync(loop, func, callback_timeout, *args, **kwargs)
    404 if error:
    405     typ, exc, tb = error
--> 406     raise exc.with_traceback(tb)
    407 else:
    408     return result

File ~/workspace/distributed/distributed/utils.py:379, in sync.<locals>.f()
    377         future = asyncio.wait_for(future, callback_timeout)
    378     future = asyncio.ensure_future(future)
--> 379     result = yield future
    380 except Exception:
    381     error = sys.exc_info()

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/coiled/context.py:77, in track_context.<locals>.wrapper(*args, **kwargs)
     75 else:
     76     with operation_context(name=f"{func.__module__}.{func.__qualname__}"):
---> 77         return await func(*args, **kwargs)

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/coiled/_beta/cluster.py:739, in ClusterBeta._start(self)
    732     if not self.package_sync:
    733         parse_identifier(
    734             self.software_environment,
    735             property_name="software_environment",
    736             can_have_revision=False,
    737         )
--> 739     self.cluster_id = await cloud._create_cluster(
    740         account=self.account,
    741         name=self.name,
    742         workers=self._start_n_workers,
    743         software_environment=self.software_environment,
    744         worker_class=self.worker_class,
    745         worker_options=self.worker_options,
    746         worker_cpu=self.worker_cpu,
    747         worker_memory=self.worker_memory,
    748         worker_disk_size=self.worker_disk_size,
    749         gcp_worker_gpu_type=self.worker_gpu_type,
    750         gcp_worker_gpu_count=self.worker_gpu_count,
    751         scheduler_class=self.scheduler_class,
    752         scheduler_options=self.scheduler_options,
    753         scheduler_cpu=self.scheduler_cpu,
    754         scheduler_memory=self.scheduler_memory,
    755         environ=self.environ,
    756         tags=self.tags,
    757         dask_config=self.frozen_dask_config,
    758         scheduler_vm_types=scheduler_vm_types_to_use
    759         or default_instance_types,
    760         worker_vm_types=worker_vm_types_to_use or default_instance_types,
    761         backend_options=self.backend_options,
    762         use_scheduler_public_ip=self.use_scheduler_public_ip,
    763         auto_env=env,
    764     )
    765 if not self.cluster_id:
    766     raise RuntimeError(f"Failed to find/create cluster {self.name}")

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/coiled/context.py:77, in track_context.<locals>.wrapper(*args, **kwargs)
     75 else:
     76     with operation_context(name=f"{func.__module__}.{func.__qualname__}"):
---> 77         return await func(*args, **kwargs)

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/coiled/_beta/core.py:678, in CloudBeta._create_cluster(self, name, software_environment, worker_class, worker_options, worker_cpu, worker_memory, scheduler_class, scheduler_options, scheduler_cpu, scheduler_memory, account, workers, environ, tags, dask_config, scheduler_vm_types, gcp_worker_gpu_type, gcp_worker_gpu_count, worker_vm_types, worker_disk_size, backend_options, use_scheduler_public_ip, auto_env)
    658 data = {
    659     "name": name,
    660     "workers": workers,
   (...)
    675     # "jupyter_on_scheduler": True,
    676 }
    677 if auto_env is not None:
--> 678     data["env_id"] = await self._create_software_environment_v2(
    679         account=account, senv=auto_env
    680     )
    681 if gcp_worker_gpu_type is not None:
    682     # for backwards compatibility with v1 options
    683     backend_options = backend_options if backend_options else {}

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/coiled/context.py:77, in track_context.<locals>.wrapper(*args, **kwargs)
     75 else:
     76     with operation_context(name=f"{func.__module__}.{func.__qualname__}"):
---> 77         return await func(*args, **kwargs)

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/coiled/_beta/core.py:574, in CloudBeta._create_software_environment_v2(self, senv, account)
    572 env: List[PackageSchema] = []
    573 for pkg in file_packages:
--> 574     file_id = await self._create_senv_package(pkg["sdist"], md5=pkg["md5"])
    575     env.append(
    576         {
    577             "name": pkg["name"],
   (...)
    585         }
    586     )
    587 for pkg in [pkg for pkg in senv if not pkg["sdist"]]:

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/coiled/context.py:77, in track_context.<locals>.wrapper(*args, **kwargs)
     75 else:
     76     with operation_context(name=f"{func.__module__}.{func.__qualname__}"):
---> 77         return await func(*args, **kwargs)

File ~/mambaforge/envs/dask-distributed-310/lib/python3.10/site-packages/coiled/_beta/core.py:531, in CloudBeta._create_senv_package(self, package_file, md5, account)
    528 data = await response.json()
    529 # cant use the default session as it has coiled auth
    530 # headers
--> 531 if data["created"]:
    532     try:
    533         await self._put_package(
    534             url=data["upload_url"], package_data=package_data
    535         )

KeyError: 'created'

Upgrading to latest resolved the issue.

Are we sticking to semantic versioning?

dchudz commented 2 years ago

Sorry Florian. We were treating package sync as pretty "beta" for awhile.

@shughes-uk At this point I think we need to be more careful about requiring upgrades, right?

fjetter commented 2 years ago

No problem for me. Just wanted to raise awareness since others may run into it as well

fjetter commented 1 year ago

FYI This happened again

File ~/mambaforge/envs/dask-distributed-310-main/lib/python3.10/site-packages/coiled/_beta/core.py:535, in CloudBeta._create_senv_package(self, package_file, md5, account)
    528 response = await self._do_request(
    529     "POST",
    530     self.server
    531     + f"/api/v2/software-environment/account/{account}/package-upload",
    532     json={"name": Path(package_file.name).name, "md5": md5},
    533 )
    534 data = await response.json()
--> 535 if data["should_upload"]:
    536     try:
    537         await self._put_package(
    538             url=data["upload_url"], package_data=package_data
    539         )

I was running 0.2.42

From a UX perspective this is really bad. At the very least there should be a warning letting me know that there is a newer version available

dchudz commented 1 year ago
  1. Sorry Florian! We'll try to be better.
  2. 0.2.42 is from Nov 8. So I don't think this is a new example of the problem. (Of course a new example of it causing you pain is still relevant, and we do care.)
dchudz commented 1 year ago

Also, any particular reason you switched to a public issue? (https://github.com/coiled/feedback/issues/221)

There's some circumstances where engagement in the public feedback repo seems valuable, [sorry if this sounds grouchy] but an old bug that you hit again b/c you were still using an old version doesn't strike me as one.

dchudz commented 1 year ago

Ohh, I'm sorry. I guess this is a new instance of the issue?

dchudz commented 1 year ago

You upgraded from 0.2.38 to 0.2.42 to make things work a couple weeks ago, and then hit it again in 0.2.42 because of recent changes.

fjetter commented 1 year ago

Also, any particular reason you switched to a public issue? (https://github.com/coiled/feedback/issues/221)

This is also public. Didn't switch but I can open these things in the private tracker from now on

fjetter commented 1 year ago

You upgraded from 0.2.38 to 0.2.42 to make things work a couple weeks ago, and then hit it again in 0.2.42 because of recent changes.

I know this is not identical. I wanted to reduce spam for you. Will open a dedicated issue next time. Feel the pain!

dchudz commented 1 year ago

This is also public.

Ah indeed it is! :-) Sorry, confused about what repo I'm in. That's okay, not saying anything that shouldn't be public.

dchudz commented 1 year ago

Will open a dedicated issue next time. Feel the pain!

Either way is fine. Mostly I'm just sorry, and glad we're hearing from you.

ntabris commented 1 year ago

I suspect the error in https://github.com/coiled/feedback/issues/216#issuecomment-1333623555 isn't about backwards compatibility. I'd want to make sure client-side code is checking for non-200 response (maybe it is already, not sure).