GoogleCloudPlatform / python-docs-samples

Code samples used on cloud.google.com
Apache License 2.0
7.35k stars 6.4k forks source link

Python flex template Dockerfile #4307

Closed adsee42 closed 3 years ago

adsee42 commented 4 years ago

In which file did you encounter the issue?

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/dataflow/flex-templates/streaming_beam/Dockerfile

Describe the issue

The last line of the above Dockerfile is RUN pip install -U -r ./requirements.txt.

Why is this line needed?

According to this document, shouldn't we create a setup.py file and set environment variable FLEX_TEMPLATE_PYTHON_SETUP_FILE to specify all python dependencies?

adsee42 commented 4 years ago

I removed that line in Dockerfile and now it complains about no apache_beam:

"Using launch args: [/dataflow/template/main.py --requirements_file=/dataflow/template/requirements.txt --setup_file=/dataflow/template/setup.py --job_name=test --region=asia-northeast1 --service_account_email=976673646273-compute@developer.gserviceaccount.com --runner=DataflowRunner --project=catchtherainbow --template_location=gs://active_data/2020/07/16/test/staging/template_launches/2020-07-16_07_08_11-11717246933709634581/job_object --temp_location=gs://active_data/2020/07/16/test/temp/ --staging_location=gs://active_data/2020/07/16/test/staging/ --input=gs://active_data/2020/07/16/test/source/test-*.txt --mecab-params=-Owakati -N2 --output=gs://active_data/2020/07/16/test/sink/]"

"Executing: python /dataflow/template/main.py --requirements_file=/dataflow/template/requirements.txt --setup_file=/dataflow/template/setup.py --job_name=test --region=asia-northeast1 --service_account_email=976673646273-compute@developer.gserviceaccount.com --runner=DataflowRunner --project=catchtherainbow --template_location=gs://active_data/2020/07/16/test/staging/template_launches/2020-07-16_07_08_11-11717246933709634581/job_object --temp_location=gs://active_data/2020/07/16/test/temp/ --staging_location=gs://active_data/2020/07/16/test/staging/ --input=gs://active_data/2020/07/16/test/source/test-*.txt --mecab-params=-Owakati -N2 --output=gs://active_data/2020/07/16/test/sink/"

"Traceback (most recent call last):"

" File "/dataflow/template/main.py", line 4, in <module>"

" import apache_beam as beam"

"ImportError: No module named 'apache_beam'"

"python failed with exit status 1"

"Template launch failed: exit status 1"

Does it mean that for Flex Template,

I don't need specify dependencies in setup.py, instead I can install all dependencies with the Dockerfile?

And the environment variable FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE in Dockerfile can be removed too?

davidcavazos commented 4 years ago

Hi, this is a great question. I believe the process that launches the pipeline needs to have the requirements installed and that's why it's running the pip install -U -r ./requirements.txt. And then the FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE might be needed for setting up the workers.

@manavgarg is this correct? Could it be simplified to not have to manually run pip install -U -r ./requirements.txt since we already know the requirements.txt file location?

manavgarg commented 4 years ago

So, all the three methods are equivalent. Directly running " pip install apache-beam[gcp]==2.22.0" or specifying it as part of requirements.txt or setup.py. Now, beam is already installed in the workers so actually specifying it with requirements.txt would lead to it being installed again in the workers later thereby causing the job to take more time. We have seen some cases where jobs are getting timed out because workers are trying to install beam again (which is redundant).

What we are recommending is to to directly install beam using pip in the Dockerfile and if there are any other dependencies, then to use requirements.txt or setup.py for them.

adsee42 commented 4 years ago

Please check if I got this right:

  1. for beam, directly install it by RUN pip install apache-beam[gcp] in Dockerfile (required for building template, but already installed in workers those actually executing from template),
  2. for other dependencies available from pip, specify them in requirements.txt (beam excluded),
  3. for dependencies require apt-get, or other files (.json, .log, ...), using setup.py

It would be very helpful if there're some documents about which file actually does what and when that file is used.

manavgarg commented 4 years ago

Hi @adsee42, You are spot on. This is the right behavior. I would communicate this to the team to update the documentation.

davidcavazos commented 4 years ago

Hi @rosetn, this might be worth looking to clarify in the docs as an informational note.

davidcavazos commented 4 years ago

@manavgarg in the case for this sample, the requirements.txt includes apache-beam[gcp], which is installed via pip install -r requirements.txt in the Dockerfile, and it's pointed to the FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE as well. Does this mean the sample should be updated to remove apache-beam from the requirements.txt (maybe just having a comment with an explanation) and installing apache-beam via pip directly in the Dockerfile?

manavgarg commented 4 years ago

Hi @davidcavazos, Like I mentioned, all the 3 approaches are equivalent and having it as part of requirements.txt or installing it directly in Dockerfile should be similar in functionality. We do recommend it directly installing it in dockerfile for performance reasons. I can remove this from requirements.txt and add it to Dockerfile. Although, for the purpose of the example, it might make sense to keep the requirements.txt structure. What do you think ? (Also, keeping an empty requirements.txt might look a bit strange).

adsee42 commented 4 years ago

According to Managing Python Pipeline Dependencies, we can even

get rid of the requirements.txt file and instead, add all packages contained in requirements.txt to the install_requires field of the setup call

However, I created both files, and specified their path in Dockerfile:

ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"

What I add as dependency is mecab-python3 and added it to both files, and this python lib depends on packages that can be installed with apt-get install mecab. I can use the lib with import MeCab. (upper case M and C)

Then I created a job from Dataflow console, and it complains about no module named MeCab.

Log message before above error (I added line break and removed some information to make it easier to read):

"Executing: 
python /mecab/main.py 
--requirements_file=/mecab/requirements.txt 
--setup_file=/mecab/setup.py 
--service_account_email=*** 
--output=*** 
--job_name=mecab-test-072101 
--region=asia-northeast1 
--template_location=gs://active_data/2020/07/16/test/staging/template_launches/2020-07-21_01_46_14-10867249433501164576/job_object 
--temp_location=*** --staging_location=*** 
--input=*** 
--input-headers=*** 
--mecab-params=*** 
--runner=DataflowRunner 
--project=***"

The setup.py is based on this file

adsee42 commented 4 years ago

I've tried every pattern and the conclusion is:

  1. in Dockerfile, every requirements need to be installed, else it shows import error.
  2. requirements.txt can be used in Dockerfile to install dependencies
  3. but setup.py may be a better solution than requirements.txt, for there could be some non-python dependencies need to be installed with apt-get
  4. FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE environment variable could be removed since I use setup.py to manage dependencies.

I wonder which one get used first, requirements.txt or setup.py?

rosetn commented 4 years ago

If we decide to make documentation changes, tag me in any updates to the tutorial Dockerfile.

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/dataflow/flex-templates/streaming_beam/Dockerfile

fhinkel commented 3 years ago

Greetings, we're closing this. Looks like the issue got resolved. Please let us know if the issue needs to be reopened.