Open FurcyPin opened 2 years ago
@TheNeuralBit FYI
Thanks @FurcyPin and @damccorm.
Note we already have #22659 for exercising these notebooks continuously. I don't think that would have caught this particular issue because this is a version conflict with google-colab
. We might separately test that we don't have version conflicts with colab. Let's use this issue just to track this particular incompatibility, we can consider filing a follow-up for that testing.
For this particular incompatibility, the dependencies in question are from the interactive
extra. @KevinGG can we expand the version ranges on these dependencies to be compatible with the google-colab
deps?
Note these deps on colab-tools
are very out of date. There are some issues filed against that project to address it, e.g. https://github.com/googlecolab/colabtools/issues/2230
I would +1 asking colab-tools
to update their dependency as ipykernel v6+ now supports Jupyter native debugger.
And ipython v7 and v8 both works with ipykernel v5+. The only issue is that ipython v7 is not backward compatible to v6. I suggest they only need to upgrade their ipython dep and then they can start using ipykernel v6 w/o any compatibility issue.
Is jupyter native debugger support the only reason for the ipykernel>6
requirement? That seems like a dev requirement, not something we need to impose on our users.
It would be nice to get this fixed in google-colab since it will fix already released SDKs, but that issue has been open for a year now. It would be good to try to mitigate from our end.
Ah actually the ipykernel version conflict is non-blocking, apache-beam is still installed (we should still address this though since it's disconcerting, thanks for #23599).
The other, protobuf, issue is a hard blocker though. Here is the full stacktrace:
[/usr/local/lib/python3.7/dist-packages/apache_beam/runners/interactive/interactive_beam.py](https://localhost:8080/#) in <module>
49 from apache_beam.options.pipeline_options import FlinkRunnerOptions
50 from apache_beam.runners.interactive import interactive_environment as ie
---> 51 from apache_beam.runners.interactive.dataproc.dataproc_cluster_manager import DataprocClusterManager
52 from apache_beam.runners.interactive.dataproc.types import ClusterIdentifier
53 from apache_beam.runners.interactive.dataproc.types import ClusterMetadata
[/usr/local/lib/python3.7/dist-packages/apache_beam/runners/interactive/dataproc/dataproc_cluster_manager.py](https://localhost:8080/#) in <module>
30
31 try:
---> 32 from google.cloud import dataproc_v1
33 from apache_beam.io.gcp import gcsfilesystem #pylint: disable=ungrouped-imports
34 except ImportError:
[/usr/local/lib/python3.7/dist-packages/google/cloud/dataproc_v1/__init__.py](https://localhost:8080/#) in <module>
15 #
16
---> 17 from .services.autoscaling_policy_service import AutoscalingPolicyServiceClient
18 from .services.autoscaling_policy_service import AutoscalingPolicyServiceAsyncClient
19 from .services.batch_controller import BatchControllerClient
[/usr/local/lib/python3.7/dist-packages/google/cloud/dataproc_v1/services/autoscaling_policy_service/__init__.py](https://localhost:8080/#) in <module>
14 # limitations under the License.
15 #
---> 16 from .client import AutoscalingPolicyServiceClient
17 from .async_client import AutoscalingPolicyServiceAsyncClient
18
[/usr/local/lib/python3.7/dist-packages/google/cloud/dataproc_v1/services/autoscaling_policy_service/client.py](https://localhost:8080/#) in <module>
35 from google.cloud.dataproc_v1.services.autoscaling_policy_service import pagers
36 from google.cloud.dataproc_v1.types import autoscaling_policies
---> 37 from .transports.base import AutoscalingPolicyServiceTransport, DEFAULT_CLIENT_INFO
38 from .transports.grpc import AutoscalingPolicyServiceGrpcTransport
39 from .transports.grpc_asyncio import AutoscalingPolicyServiceGrpcAsyncIOTransport
[/usr/local/lib/python3.7/dist-packages/google/cloud/dataproc_v1/services/autoscaling_policy_service/transports/__init__.py](https://localhost:8080/#) in <module>
17 from typing import Dict, Type
18
---> 19 from .base import AutoscalingPolicyServiceTransport
20 from .grpc import AutoscalingPolicyServiceGrpcTransport
21 from .grpc_asyncio import AutoscalingPolicyServiceGrpcAsyncIOTransport
[/usr/local/lib/python3.7/dist-packages/google/cloud/dataproc_v1/services/autoscaling_policy_service/transports/base.py](https://localhost:8080/#) in <module>
31 try:
32 DEFAULT_CLIENT_INFO = gapic_v1.client_info.ClientInfo(
---> 33 gapic_version=pkg_resources.get_distribution("google-cloud-dataproc",).version,
34 )
35 except pkg_resources.DistributionNotFound:
[/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in get_distribution(dist)
464 dist = Requirement.parse(dist)
465 if isinstance(dist, Requirement):
--> 466 dist = get_provider(dist)
467 if not isinstance(dist, Distribution):
468 raise TypeError("Expected string, Requirement, or Distribution", dist)
[/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in get_provider(moduleOrReq)
340 """Return an IResourceProvider for the named module or requirement"""
341 if isinstance(moduleOrReq, Requirement):
--> 342 return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
343 try:
344 module = sys.modules[moduleOrReq]
[/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in require(self, *requirements)
884 included, even if they were already activated in this working set.
885 """
--> 886 needed = self.resolve(parse_requirements(requirements))
887
888 for dist in needed:
[/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in resolve(self, requirements, env, installer, replace_conflicting, extras)
775 # Oops, the "best" so far conflicts with a dependency
776 dependent_req = required_by[req]
--> 777 raise VersionConflict(dist, req).with_context(dependent_req)
778
779 # push the new requirements onto the stack
ContextualVersionConflict: (protobuf 3.17.3 (/usr/local/lib/python3.7/dist-packages), Requirement.parse('protobuf<5.0.0dev,>=3.19.0'), {'proto-plus'})
It looks like this is happening when importing DataprocClusterManager
.
ContextualVersionConflict: (protobuf 3.17.3 (/usr/local/lib/python3.7/dist-packages), Requirement.parse('protobuf<5.0.0dev,>=3.19.0'), {'proto-plus'})
dataproc and apache-beam both use the same proto-plus dep that transitively apply the same protobuf version range. The raised version conflict is probably irrelevant because the final protobuf installed is 3.20.3. 3.19.0 should suffice all deps' version ranges, don't know why pip picked 3.20.3 in the end.
pip's dependency resolver does not currently take into account all the packages that are installed
This is probably why.
So it's a problem with pip. There can be 2 solutions:
Do we know where 3.17.3 came from? The initial install says we have 3.20.3. How do we get 3.17.3 before the version conflict in the dataproc import.
I wonder if installing a newer google-cloud-dataproc
will help. We are depending on 3.x, but 5.x is out now.
Alternatively, can we make the dataproc stuff separable? This example notebook is not doing anything with dataproc, but it has been broken by the introduction of DataprocClusterManager.
I know what went wrong. The example needs to be updated. Always use !pip install
in Colab, not %pip install
.
Note that %pip
is a notebook magic (not a shell cmd) that only works on Jupyter notebook runtimes that allows multiple kernel selections. When %pip install
, it installs the dependency to the venv of the connected IPython kernel.
This is required in products such as Beam Notebooks on Google Cloud. Because the notebook runtime env is separated from the IPython kernel env.
Colab doesn't have this separation. Its runtime env is also its IPython kernel env. %pip
didn't work with CoLab in the past. I don't know why it's "working" now. But Colab as a notebook frontend doesn't integrate with this magic or Jupyter architecture well and has messed up with the package management.
Instead, use !pip
that is a shell cmd that always works in Colab as it's a single kernel notebook runtime.
I tried running the notebook on colab with !pip
and got the same errors
You are right, I tried it again. So the thing fixed the example is restarting the runtime after installing the dependency.
In that case, we need to add a note in the example to tell the user to restart the runtime if it's the 1st time execute the 1st cell installing apache-beam.
Removing the --quiet
parameter, the warning is displayed at the end of the installation:
What happened?
I tried running the DataFrame example in Google Colab, and after running the first cell:
I got the following error:
Trying to run the next cell:
also gave an error:
You should add a validation pipeline for you Google Colab examples (if you don't have it already), because those notebook are directly referenced by the documentation, and live examples that crashes may hurt user adoption.
Issue Priority
Priority: 1
Issue Component
Component: examples-python