apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.82k stars 4.24k forks source link

[Bug]: Google Colab DataFrame example crashes with dependency conflict #23599

Open FurcyPin opened 2 years ago

FurcyPin commented 2 years ago

What happened?

I tried running the DataFrame example in Google Colab, and after running the first cell:

%pip install --quiet apache-beam[interactive,dataframe]

I got the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.8.2+zzzcolab20220929150707 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.3 which is incompatible.
google-colab 1.0.0 requires ipykernel~=5.3.4, but you have ipykernel 6.16.0 which is incompatible.
google-colab 1.0.0 requires ipython~=7.9.0, but you have ipython 7.34.0 which is incompatible.
google-colab 1.0.0 requires tornado~=5.1.0, but you have tornado 6.2 which is incompatible.

Trying to run the next cell:

import apache_beam as beam
import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner

pipeline = beam.Pipeline(InteractiveRunner())

# Create a deferred Beam DataFrame with the contents of our csv file.
beam_df = pipeline | 'Read CSV' >> beam.dataframe.io.read_csv('solar_events.csv')

# We can use `ib.collect` to view the contents of a Beam DataFrame.
ib.collect(beam_df)

also gave an error:

ContextualVersionConflict                 Traceback (most recent call last)
[<ipython-input-4-df06c7cf7bd0>](https://localhost:8080/#) in <module>
      1 import apache_beam as beam
----> 2 import apache_beam.runners.interactive.interactive_beam as ib
      3 from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
      4 
      5 pipeline = beam.Pipeline(InteractiveRunner())

10 frames
[/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in resolve(self, requirements, env, installer, replace_conflicting, extras)
    775                 # Oops, the "best" so far conflicts with a dependency
    776                 dependent_req = required_by[req]
--> 777                 raise VersionConflict(dist, req).with_context(dependent_req)
    778 
    779             # push the new requirements onto the stack

ContextualVersionConflict: (protobuf 3.17.3 (/usr/local/lib/python3.7/dist-packages), Requirement.parse('protobuf<5.0.0dev,>=3.19.0'), {'proto-plus'})

You should add a validation pipeline for you Google Colab examples (if you don't have it already), because those notebook are directly referenced by the documentation, and live examples that crashes may hurt user adoption.

Issue Priority

Priority: 1

Issue Component

Component: examples-python

damccorm commented 2 years ago

@TheNeuralBit FYI

TheNeuralBit commented 2 years ago

Thanks @FurcyPin and @damccorm.

Note we already have #22659 for exercising these notebooks continuously. I don't think that would have caught this particular issue because this is a version conflict with google-colab. We might separately test that we don't have version conflicts with colab. Let's use this issue just to track this particular incompatibility, we can consider filing a follow-up for that testing.

For this particular incompatibility, the dependencies in question are from the interactive extra. @KevinGG can we expand the version ranges on these dependencies to be compatible with the google-colab deps?

TheNeuralBit commented 2 years ago

Note these deps on colab-tools are very out of date. There are some issues filed against that project to address it, e.g. https://github.com/googlecolab/colabtools/issues/2230

nika-qubit commented 2 years ago

I would +1 asking colab-tools to update their dependency as ipykernel v6+ now supports Jupyter native debugger.

And ipython v7 and v8 both works with ipykernel v5+. The only issue is that ipython v7 is not backward compatible to v6. I suggest they only need to upgrade their ipython dep and then they can start using ipykernel v6 w/o any compatibility issue.

TheNeuralBit commented 2 years ago

Is jupyter native debugger support the only reason for the ipykernel>6 requirement? That seems like a dev requirement, not something we need to impose on our users.

It would be nice to get this fixed in google-colab since it will fix already released SDKs, but that issue has been open for a year now. It would be good to try to mitigate from our end.

TheNeuralBit commented 1 year ago

Ah actually the ipykernel version conflict is non-blocking, apache-beam is still installed (we should still address this though since it's disconcerting, thanks for #23599).

The other, protobuf, issue is a hard blocker though. Here is the full stacktrace:

[/usr/local/lib/python3.7/dist-packages/apache_beam/runners/interactive/interactive_beam.py](https://localhost:8080/#) in <module>
     49 from apache_beam.options.pipeline_options import FlinkRunnerOptions
     50 from apache_beam.runners.interactive import interactive_environment as ie
---> 51 from apache_beam.runners.interactive.dataproc.dataproc_cluster_manager import DataprocClusterManager
     52 from apache_beam.runners.interactive.dataproc.types import ClusterIdentifier
     53 from apache_beam.runners.interactive.dataproc.types import ClusterMetadata

[/usr/local/lib/python3.7/dist-packages/apache_beam/runners/interactive/dataproc/dataproc_cluster_manager.py](https://localhost:8080/#) in <module>
     30 
     31 try:
---> 32   from google.cloud import dataproc_v1
     33   from apache_beam.io.gcp import gcsfilesystem  #pylint: disable=ungrouped-imports
     34 except ImportError:

[/usr/local/lib/python3.7/dist-packages/google/cloud/dataproc_v1/__init__.py](https://localhost:8080/#) in <module>
     15 #
     16 
---> 17 from .services.autoscaling_policy_service import AutoscalingPolicyServiceClient
     18 from .services.autoscaling_policy_service import AutoscalingPolicyServiceAsyncClient
     19 from .services.batch_controller import BatchControllerClient

[/usr/local/lib/python3.7/dist-packages/google/cloud/dataproc_v1/services/autoscaling_policy_service/__init__.py](https://localhost:8080/#) in <module>
     14 # limitations under the License.
     15 #
---> 16 from .client import AutoscalingPolicyServiceClient
     17 from .async_client import AutoscalingPolicyServiceAsyncClient
     18 

[/usr/local/lib/python3.7/dist-packages/google/cloud/dataproc_v1/services/autoscaling_policy_service/client.py](https://localhost:8080/#) in <module>
     35 from google.cloud.dataproc_v1.services.autoscaling_policy_service import pagers
     36 from google.cloud.dataproc_v1.types import autoscaling_policies
---> 37 from .transports.base import AutoscalingPolicyServiceTransport, DEFAULT_CLIENT_INFO
     38 from .transports.grpc import AutoscalingPolicyServiceGrpcTransport
     39 from .transports.grpc_asyncio import AutoscalingPolicyServiceGrpcAsyncIOTransport

[/usr/local/lib/python3.7/dist-packages/google/cloud/dataproc_v1/services/autoscaling_policy_service/transports/__init__.py](https://localhost:8080/#) in <module>
     17 from typing import Dict, Type
     18 
---> 19 from .base import AutoscalingPolicyServiceTransport
     20 from .grpc import AutoscalingPolicyServiceGrpcTransport
     21 from .grpc_asyncio import AutoscalingPolicyServiceGrpcAsyncIOTransport

[/usr/local/lib/python3.7/dist-packages/google/cloud/dataproc_v1/services/autoscaling_policy_service/transports/base.py](https://localhost:8080/#) in <module>
     31 try:
     32     DEFAULT_CLIENT_INFO = gapic_v1.client_info.ClientInfo(
---> 33         gapic_version=pkg_resources.get_distribution("google-cloud-dataproc",).version,
     34     )
     35 except pkg_resources.DistributionNotFound:

[/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in get_distribution(dist)
    464         dist = Requirement.parse(dist)
    465     if isinstance(dist, Requirement):
--> 466         dist = get_provider(dist)
    467     if not isinstance(dist, Distribution):
    468         raise TypeError("Expected string, Requirement, or Distribution", dist)

[/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in get_provider(moduleOrReq)
    340     """Return an IResourceProvider for the named module or requirement"""
    341     if isinstance(moduleOrReq, Requirement):
--> 342         return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
    343     try:
    344         module = sys.modules[moduleOrReq]

[/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in require(self, *requirements)
    884         included, even if they were already activated in this working set.
    885         """
--> 886         needed = self.resolve(parse_requirements(requirements))
    887 
    888         for dist in needed:

[/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py](https://localhost:8080/#) in resolve(self, requirements, env, installer, replace_conflicting, extras)
    775                 # Oops, the "best" so far conflicts with a dependency
    776                 dependent_req = required_by[req]
--> 777                 raise VersionConflict(dist, req).with_context(dependent_req)
    778 
    779             # push the new requirements onto the stack

ContextualVersionConflict: (protobuf 3.17.3 (/usr/local/lib/python3.7/dist-packages), Requirement.parse('protobuf<5.0.0dev,>=3.19.0'), {'proto-plus'})

It looks like this is happening when importing DataprocClusterManager.

nika-qubit commented 1 year ago

ContextualVersionConflict: (protobuf 3.17.3 (/usr/local/lib/python3.7/dist-packages), Requirement.parse('protobuf<5.0.0dev,>=3.19.0'), {'proto-plus'})

dataproc and apache-beam both use the same proto-plus dep that transitively apply the same protobuf version range. The raised version conflict is probably irrelevant because the final protobuf installed is 3.20.3. 3.19.0 should suffice all deps' version ranges, don't know why pip picked 3.20.3 in the end.

nika-qubit commented 1 year ago

pip's dependency resolver does not currently take into account all the packages that are installed

This is probably why.

So it's a problem with pip. There can be 2 solutions:

TheNeuralBit commented 1 year ago

Do we know where 3.17.3 came from? The initial install says we have 3.20.3. How do we get 3.17.3 before the version conflict in the dataproc import.

I wonder if installing a newer google-cloud-dataproc will help. We are depending on 3.x, but 5.x is out now.

Alternatively, can we make the dataproc stuff separable? This example notebook is not doing anything with dataproc, but it has been broken by the introduction of DataprocClusterManager.

nika-qubit commented 1 year ago

I know what went wrong. The example needs to be updated. Always use !pip install in Colab, not %pip install.

Note that %pip is a notebook magic (not a shell cmd) that only works on Jupyter notebook runtimes that allows multiple kernel selections. When %pip install, it installs the dependency to the venv of the connected IPython kernel.

This is required in products such as Beam Notebooks on Google Cloud. Because the notebook runtime env is separated from the IPython kernel env.

Colab doesn't have this separation. Its runtime env is also its IPython kernel env. %pip didn't work with CoLab in the past. I don't know why it's "working" now. But Colab as a notebook frontend doesn't integrate with this magic or Jupyter architecture well and has messed up with the package management.

Instead, use !pip that is a shell cmd that always works in Colab as it's a single kernel notebook runtime.

TheNeuralBit commented 1 year ago

I tried running the notebook on colab with !pip and got the same errors

nika-qubit commented 1 year ago

You are right, I tried it again. So the thing fixed the example is restarting the runtime after installing the dependency.

In that case, we need to add a note in the example to tell the user to restart the runtime if it's the 1st time execute the 1st cell installing apache-beam.

nika-qubit commented 1 year ago

Removing the --quiet parameter, the warning is displayed at the end of the installation: ZyGPsLvqxYMucMb