Closed apmorton closed 9 months ago
Can you try again with ray-2.9.0 which was released a few hours ago?
Using ray 2.9.0 and grpcio:
1.59.3 - working
1.58.2 - broken
1.57.0 - working
1.56.0 - working
1.55.1 - broken
1.54.3 - broken
It seems like ray 2.9.0 is also broken in that it doesn't specify any minimum grpcio version pins - it should be >=1.56, !1.58
Logs:
1.59.3
2024-01-16 15:14:19,038 INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: logging_level
2024-01-16 15:14:19,051 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.IDLE
2024-01-16 15:14:19,255 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2024-01-16 15:14:19,256 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.READY
2024-01-16 15:14:19,256 DEBUG worker.py:813 -- Pinging server.
hello
2024-01-16 15:14:21,552 DEBUG dataclient.py:294 -- Got unawaited response connection_cleanup {
}
2024-01-16 15:14:21,738 DEBUG dataclient.py:285 -- Shutting down data channel.
1.58.2
2024-01-16 15:03:31,393 INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: logging_level
2024-01-16 15:03:31,409 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.IDLE
2024-01-16 15:03:31,613 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2024-01-16 15:03:31,613 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.READY
2024-01-16 15:03:31,614 DEBUG worker.py:813 -- Pinging server.
1.57.0
2024-01-16 15:28:20,436 INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: logging_level
2024-01-16 15:28:20,452 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.IDLE
2024-01-16 15:28:20,656 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2024-01-16 15:28:20,658 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.READY
2024-01-16 15:28:20,658 DEBUG worker.py:813 -- Pinging server.
SIGTERM handler is not set because current thread is not the main thread.
hello
2024-01-16 15:28:23,003 DEBUG dataclient.py:294 -- Got unawaited response connection_cleanup {
}
2024-01-16 15:28:23,238 DEBUG dataclient.py:285 -- Shutting down data channel.
1.56.0
2024-01-16 15:06:18,841 INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: logging_level
2024-01-16 15:06:18,854 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.IDLE
2024-01-16 15:06:19,058 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2024-01-16 15:06:19,060 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.READY
2024-01-16 15:06:19,060 DEBUG worker.py:813 -- Pinging server.
hello
2024-01-16 15:06:21,459 DEBUG dataclient.py:294 -- Got unawaited response connection_cleanup {
}
2024-01-16 15:06:21,630 DEBUG dataclient.py:285 -- Shutting down data channel.
1.55.1
2024-01-16 15:19:04,521 INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: logging_level
2024-01-16 15:19:04,532 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.IDLE
2024-01-16 15:19:04,737 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2024-01-16 15:19:04,738 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.TRANSIENT_FAILURE
2024-01-16 15:19:09,534 DEBUG worker.py:226 -- Couldn't connect channel in 5 seconds, retrying
2024-01-16 15:19:09,534 DEBUG worker.py:237 -- Waiting for Ray to become ready on the server, retry in 5s...
1.54.3
2024-01-16 15:11:17,451 INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: logging_level
2024-01-16 15:11:17,465 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.IDLE
2024-01-16 15:11:17,669 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2024-01-16 15:11:17,671 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.TRANSIENT_FAILURE
2024-01-16 15:11:22,466 DEBUG worker.py:226 -- Couldn't connect channel in 5 seconds, retrying
2024-01-16 15:11:22,466 DEBUG worker.py:237 -- Waiting for Ray to become ready on the server, retry in 5s...
Hmm. I think the pin was removed in dcc2f630f83ed324fb507edcff06e2442275094c by @timkpaine. Was there a reason for that?
FWIW, the pin >=1.56, !1.58
comes from upstream, and worked for 2.8.0. In the first comment of this issue you are using 1.58.2
which is not supported.
In the first comment of this issue you are using 1.58.2 which is not supported.
If this is not supported, then why does it solve?
The pin in 2.8.x
is >=1.50,<1.59
, which suggests 1.58 should work...
Additionally, the conda-forge pinning repo specifies 1.58, which is why my environment picked 1.58 - many other packages (correctly) pin to the grpcio version they built against.
Where is the upstream pin you speak of?
Really I think the problem here is two fold:
ray-core
does not specify a build dependency on grpc that matches that vendored copy of grpc from upstream.grpcio
specifies a run_exports
: https://github.com/conda-forge/grpc-cpp-feedstock/blob/main/recipe/meta.yaml#L45
If ray-core took it as a build dependency then ray-core would pick the correct, compatible, version of grpcio at runtime.
1.57.1 is the upstream version - https://github.com/ray-project/ray/blob/ray-2.9.0/bazel/ray_deps_setup.bzl#L246
This is unfortunate as it is broken on conda-forge:
Encountered problems while solving:
- package grpcio-1.57.1-py310h1b8f574_0 requires libgrpc 1.57.1 hd92f1f0_0, but none of the providers can be installed
1.57.0 works, but given the conda-forge wide pin it would be better to get 1.58 working: https://github.com/conda-forge/conda-forge-pinning-feedstock/blob/main/recipe/conda_build_config.yaml#L496-L497
Please address upstream https://github.com/ray-project/ray/blob/master/python/setup.py if pinning grpc is desired
However I don't think this has anything to do with ray, grpcio-tools is not available for those versions you show as broken and grpc version must == grpcio-tools version. We don't have a dep on that, some other dependency must be bringing it in without a proper version constraint. (E.g. your https://github.com/conda-forge/grpcio-tools-feedstock/issues/26)
This is definitely a conda-forge packaging issue.
Installing ray 2.8.1 and grpcio 1.58.0 using pip works:
---
name: test
channels:
- conda-forge
dependencies:
- python 3.11
- pip
- pip:
- ray==2.8.1
- grpcio==1.58.0
$ env/bin/ray start --head
Usage stats collection is disabled.
Local node IP: 172.18.204.157
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='172.18.204.157:6379'
To connect to this Ray cluster:
import ray
ray.init()
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
$ env/bin/python -c 'import ray; import logging; ray.init(address="ray://127.0.0.1:10001", logging_level=logging.DEBUG); print("hello")'
2024-01-16 16:57:51,816 INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: logging_level
2024-01-16 16:57:51,845 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.IDLE
2024-01-16 16:57:52,053 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2024-01-16 16:57:52,054 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.READY
2024-01-16 16:57:52,055 DEBUG worker.py:813 -- Pinging server.
hello
2024-01-16 16:57:54,209 DEBUG dataclient.py:294 -- Got unawaited response connection_cleanup {
}
2024-01-16 16:57:54,440 DEBUG dataclient.py:285 -- Shutting down data channel.
Installing ray 2.8.1 and grpcio 1.58.1 using conda-forge does not: (unfortunately 1.58.0 is not available for a true 1:1 comparison, but close enough)
---
name: test
channels:
- conda-forge
dependencies:
- python 3.11
- ray-all 2.8.1
- grpcio 1.58.1
$ env/bin/ray start --head
Usage stats collection is disabled.
Local node IP: 172.18.204.157
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='172.18.204.157:6379'
To connect to this Ray cluster:
import ray
ray.init()
To submit a Ray job using the Ray Jobs CLI:
RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py
See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
for more information on submitting Ray jobs to the Ray cluster.
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
To monitor and debug Ray, view the dashboard at
127.0.0.1:8265
If connection to the dashboard fails, check your firewall settings and network configuration.
$ env/bin/python -c 'import ray; import logging; ray.init(address="ray://127.0.0.1:10001", logging_level=logging.DEBUG); print("hello")'
2024-01-16 17:26:39,711 INFO client_builder.py:243 -- Passing the following kwargs to ray.init() on the server: logging_level
2024-01-16 17:26:39,739 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.IDLE
2024-01-16 17:26:39,943 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2024-01-16 17:26:39,944 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.READY
2024-01-16 17:26:39,945 DEBUG worker.py:813 -- Pinging server.
<hangs forever>
As you can see, grpcio-tools is not installed in this environment:
$ ls env/conda-meta/ | grep grpc
grpcio-1.58.1-py311ha6695c7_2.json
libgrpc-1.58.1-he06187c_2.json
@timkpaine I don't understand why you closed this. Clearly there is a problem, see the comment above. That is why we had a pin for grpcio in this recipe. Now it is true that pin does not come from upstream, but it is needed for conda-forge.
I think there is something subtly wrong with the interaction between ray's vendored grpcio and the one in cond-forge.
Also related to #90.
I don't understand the comments about grpcio-tools. Could you explain a bit more what you mean by
grpc version must == grpcio-tools version
There is nothing wrong with this recipe, and any pinning of grpc needs to be done upstream in ray. We shouldn't be pinning things that upstream doesn't do, except out of necessity. This issue should be closed as there's nothing actionable here. If you want to force certain grpc versions for certain prior ray versions, use https://github.com/conda-forge/conda-forge-repodata-patches-feedstock.
that is why we had a pin for grpcio in this recipe
The pin used was arbitrary and not actively maintained, which is a good reason to avoid and fix the root cause of things
It seems like perhaps the grpcio 1.58.x package itself is broken on conda-forge.
env1.yml
---
name: test
channels:
- conda-forge
dependencies:
- python 3.11
- ray-all 2.9.0
- grpcio 1.58.1
env2.yml
---
name: test
channels:
- conda-forge
dependencies:
- python 3.11
- pip
- pip:
- ray==2.9.0
- grpcio==1.58.0
Makefile
all: repro
env1/bin/python: env1.yml
@rm -rf env1
@mamba env create -f env1.yml -p env1
env2/bin/python: env2.yml
@rm -rf env2
@mamba env create -f env2.yml -p env2
solve: env1/bin/python env2/bin/python
repro: solve
env1/bin/ray stop
env1/bin/ray start --head
env1/bin/python -c 'import ray; ray.init(address="ray://127.0.0.1:10001"); print("hello")'
env1/lib/python3.11/site-packages/grpc/_cython/cygrpc.cpython-311-x86_64-linux-gnu.so.bak:
cp env1/lib/python3.11/site-packages/grpc/_cython/cygrpc.cpython-311-x86_64-linux-gnu.so env1/lib/python3.11/site-packages/grpc/_cython/cygrpc.cpython-311-x86_64-linux-gnu.so.bak
fix: env1/lib/python3.11/site-packages/grpc/_cython/cygrpc.cpython-311-x86_64-linux-gnu.so.bak
env1/bin/ray stop
cp env2/lib/python3.11/site-packages/grpc/_cython/cygrpc.cpython-311-x86_64-linux-gnu.so env1/lib/python3.11/site-packages/grpc/_cython/cygrpc.cpython-311-x86_64-linux-gnu.so
break:
env1/bin/ray stop
cp env1/lib/python3.11/site-packages/grpc/_cython/cygrpc.cpython-311-x86_64-linux-gnu.so.bak env1/lib/python3.11/site-packages/grpc/_cython/cygrpc.cpython-311-x86_64-linux-gnu.so
run make repro
and observe it hanging forever
run make fix
and then make repro
and observe it working now
run make break
and then make repro
and observe it hanging forever again
Specifically copying the cython extension from the pypi release of grpcio 1.58.0 over top of the conda-forge build of grpcio 1.58.1 fixes this.
Not sure how to proceed here.
Probably worth reporting on the grpc-cpp feedstock, I also observed widespread problems with 1.58 on conda and went to 1.57, but I thought it was due to bad interactions with grpcio-tools which you've shown to not be the case.
Again, if there's action to be taken, it needs to be in the repodata patches repo. If e.g. you add a pin here for ray and increment the build number to 1
, the solver can easily just pick the earlier build 0
.
@timkpaine please do not close this issue until we have reached a resolution.
@mattip please stop reopening it, it is not resolvable in this repo... Ray works fine with grpc 1.58 as already demonstrated above. If there is a problem with conda forge's grpc 1.58, it should not be resolved in this repo (will need to be grpc-cpp-feedstock
and either admin-requests
if its broken or repodata-patches
if existing deps need to be adjusted). If there's an issue specifically with ray and its grpc dependency, if its conda-forge only it needs to be repodata-patches
and if its upstream it should be done concurrently in this repo and upstream. The issue opener seems to want to put a !=1.58
pin in the feedstock but that is not the right action to take, and the issue should be closed. Adding a pin in the feedstock will not effect existing builds and e.g. an install of e.g. grpc=1.58 ray
will simply resolve to a build number without that pin which is an undesirable behavior.
I opened conda-forge/grpc-cpp-feedstock#343. I am not convinced that the problem lies elsewhere: the grpcio feedstock is in use by other projects and they did not report problems with 1.58 (from the feedstock). It may be some subtle interaction with the way conda-forge builds grpcio and the way bazel builds ray. I am glad we can suggest a solution, which is to avoid (ray2.8.0, ray2.9.0) + grpcio 1.58 (from conda-forge), even if we have no way to enforce it. Do we know if the same problem occurs with earlier versions of ray?
Edit: try make it clear that the 1.58 problems are only with the conda-forge build, not with a pip-installed version
@mattip There is a way to enforce it, which is the main way via repodata-patches. We can easily say "when you use ray on conda, don't use grpc 1.58" and it will be effective across all released versions vs e.g. putting a pin in this repo will not effect any of the existing released versions. I would also suspect a bad interaction given ray's vendoring of grpc, but its strange that it is only certain grpc versions (and there's issues on the grpc feedstock to suggest other bad interactions, so I wouldn't write-off the possibility that 1.58 is just subtley broken in general).
Here's a great example (by the OP!) https://github.com/conda-forge/conda-forge-repodata-patches-feedstock/pull/618
TL;DR: @apmorton did you come across this problem when trying to install a certain combination of packages without pinning versions? Or is there a reason you are interested in grpcio<1.59?
I did a dependency analysis using repquery whoneeds
and repoquery depends
on conda-forge to try to figure out what packages pin to older versions of grpcio. Disclaimer: this will only check the latest version of the packages, if a previous verison of a package pins grpcio<X perhaps the resolution will be forced to an earlier version of grpcio.
`$ for p in $(conda repoquery whoneeds -c conda-forge grpcio | cut -f2 -d" " | sort -u | grep -v ────); do echo $p; echo " " $(conda repoquery depends $p | grep "^ grpcio "); done`
TL;DR: most packages have updated to 1.60. The packages that do not pin grpcio (like ray-core 2.9.0) are reported as depending on grpcio-1.14. Here are the ones that are left:
flytekitplugins-modin
grpcio 1.46.3 py39hf176720_1 conda-forge linux-64
pymilvus
grpcio 1.37.1 py39hff7568b_0 conda-forge linux-64
tensorflow-base
grpcio 1.54.3 py39h227be39_0 conda-forge linux-64
Trying to install these packages together with ray-client
$ conda create -n throwaway ray-client tensorflow-base -> grpcio 1.59
$ conda create -n throwaway ray-client pymilvus -> grpcio 1.33
$ conda create -n throwaway ray-client flytekitplugins-modin -> fails to resolve
Those last two packages seem to be unmaintained and not very popular. Bottom line: are there packages that in practice resolve with ray to a bad version of grpcio?
I am going to close this for now. I think the problems with jax (mehtioined elsewhere as a source of problems with ray's grcpio pin) trancend just ray, and prevent jax being used with many other conda-forge packages that also adopted the current conda-forge migration to grcpio 1.59.
Solution to issue cannot be found in the documentation.
Issue
The following works if you install
grpcio<1.56
:Installed packages
Environment info
Related #90