Building-ML-Pipelines / building-machine-learning-pipelines

Code repository for the O'Reilly publication "Building Machine Learning Pipelines" by Hannes Hapke & Catherine Nelson
MIT License
585 stars 250 forks source link

Data Validation - GCP Cloud DataFlow - No module named IPython #17

Closed mshearer0 closed 4 years ago

mshearer0 commented 4 years ago

I get the following when trying to generate_statistics using dataflow:

File "/usr/local/lib/python3.7/site-packages/tensorflow_data_validation/utils/display_util.py", line 39, in 'tensorflow-data-validation[visualization]": {}'.format(e)) ImportError: To use visualization features, make sure ipython is installed, or install TFDV using "pip install tensorflow-data-validation[visualization]": No module named 'IPython'

Pip list shows tensorflow==2.3.0, tensorflow-data-validation==0.23.0, ipython ==7.17.0 as per https://pypi.org/project/tensorflow-data-validation/

I'm using: tensorflow_data_validation-0.23.0-cp37-cp37m-manylinux2010_x86_64.whl

Works fine with DirectRunner

hanneshapke commented 4 years ago

@drcat101 Any suggestions?

mshearer0 commented 4 years ago

Resolved by adding ipython extra package:

setup_options.extra_packages = [ './tensorflow_data_validation-0.23.0-cp37-cp37m-manylinux2010_x86_64.whl', 'ipython-7.17.0-py3-none-any.whl']

catherinenelson1 commented 4 years ago

@mshearer0 thank you for adding the solution.

hanneshapke commented 4 years ago

@mshearer0 Can you please share your statsgen setup? Are you visualizing the results from the components after the Dataflow Runner completes the execution?

mshearer0 commented 4 years ago

@hanneshapke as per Chapter4:

import tensorflow_data_validation as tfdv

from apache_beam.options.pipeline_options import ( PipelineOptions, GoogleCloudOptions, StandardOptions)

options = PipelineOptions() google_cloud_options = options.view_as(GoogleCloudOptions) google_cloud_options.project = 'xxx' google_cloud_options.job_name = 'beamtfdv' google_cloud_options.staging_location = 'gs://xxx/staging' google_cloud_options.temp_location = 'gs://xxx/tmp' google_cloud_options.region='europe-west1' options.view_as(StandardOptions).runner = 'DataflowRunner'

from apache_beam.options.pipeline_options import SetupOptions

setup_options = options.view_as(SetupOptions) setup_options.extra_packages = [ 'tensorflow_data_validation-0.23.0-cp37-cp37m-manylinux2010_x86_64.whl', 'ipython-7.18.1-py3-none-any.whl']

data_set_path = 'gs://xxx/consumer-complaints.tfrecords' output_path = 'gs://xxx/' tfdv.generate_statistics_from_tfrecord(data_set_path, output_path=output_path, pipeline_options=options)

hanneshapke commented 4 years ago

Thank you @mshearer0 . I wonder if iPython was previously pre-installed in the dataflow instances. I am glad the setup is working for you now.