Closed adsee42 closed 3 years ago
I removed that line in Dockerfile and now it complains about no apache_beam
:
"Using launch args: [/dataflow/template/main.py --requirements_file=/dataflow/template/requirements.txt --setup_file=/dataflow/template/setup.py --job_name=test --region=asia-northeast1 --service_account_email=976673646273-compute@developer.gserviceaccount.com --runner=DataflowRunner --project=catchtherainbow --template_location=gs://active_data/2020/07/16/test/staging/template_launches/2020-07-16_07_08_11-11717246933709634581/job_object --temp_location=gs://active_data/2020/07/16/test/temp/ --staging_location=gs://active_data/2020/07/16/test/staging/ --input=gs://active_data/2020/07/16/test/source/test-*.txt --mecab-params=-Owakati -N2 --output=gs://active_data/2020/07/16/test/sink/]"
"Executing: python /dataflow/template/main.py --requirements_file=/dataflow/template/requirements.txt --setup_file=/dataflow/template/setup.py --job_name=test --region=asia-northeast1 --service_account_email=976673646273-compute@developer.gserviceaccount.com --runner=DataflowRunner --project=catchtherainbow --template_location=gs://active_data/2020/07/16/test/staging/template_launches/2020-07-16_07_08_11-11717246933709634581/job_object --temp_location=gs://active_data/2020/07/16/test/temp/ --staging_location=gs://active_data/2020/07/16/test/staging/ --input=gs://active_data/2020/07/16/test/source/test-*.txt --mecab-params=-Owakati -N2 --output=gs://active_data/2020/07/16/test/sink/"
"Traceback (most recent call last):"
" File "/dataflow/template/main.py", line 4, in <module>"
" import apache_beam as beam"
"ImportError: No module named 'apache_beam'"
"python failed with exit status 1"
"Template launch failed: exit status 1"
Does it mean that for Flex Template,
I don't need specify dependencies in setup.py
, instead I can install all dependencies with the Dockerfile?
And the environment variable FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
in Dockerfile can be removed too?
Hi, this is a great question. I believe the process that launches the pipeline needs to have the requirements installed and that's why it's running the pip install -U -r ./requirements.txt
. And then the FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
might be needed for setting up the workers.
@manavgarg is this correct? Could it be simplified to not have to manually run pip install -U -r ./requirements.txt
since we already know the requirements.txt
file location?
So, all the three methods are equivalent. Directly running " pip install apache-beam[gcp]==2.22.0" or specifying it as part of requirements.txt or setup.py. Now, beam is already installed in the workers so actually specifying it with requirements.txt would lead to it being installed again in the workers later thereby causing the job to take more time. We have seen some cases where jobs are getting timed out because workers are trying to install beam again (which is redundant).
What we are recommending is to to directly install beam using pip in the Dockerfile and if there are any other dependencies, then to use requirements.txt or setup.py for them.
Please check if I got this right:
RUN pip install apache-beam[gcp]
in Dockerfile (required for building template, but already installed in workers those actually executing from template),requirements.txt
(beam excluded),apt-get
, or other files (.json, .log, ...), using setup.py
requirements.txt
and setup.py
are not used, but all files should be ADD
ed.requirements.txt
and setup.py
are used to install dependencies in workersIt would be very helpful if there're some documents about which file actually does what and when that file is used.
Hi @adsee42, You are spot on. This is the right behavior. I would communicate this to the team to update the documentation.
Hi @rosetn, this might be worth looking to clarify in the docs as an informational note.
@manavgarg in the case for this sample, the requirements.txt
includes apache-beam[gcp]
, which is installed via pip install -r requirements.txt
in the Dockerfile, and it's pointed to the FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
as well. Does this mean the sample should be updated to remove apache-beam
from the requirements.txt
(maybe just having a comment with an explanation) and installing apache-beam
via pip
directly in the Dockerfile?
Hi @davidcavazos, Like I mentioned, all the 3 approaches are equivalent and having it as part of requirements.txt or installing it directly in Dockerfile should be similar in functionality. We do recommend it directly installing it in dockerfile for performance reasons. I can remove this from requirements.txt and add it to Dockerfile. Although, for the purpose of the example, it might make sense to keep the requirements.txt structure. What do you think ? (Also, keeping an empty requirements.txt might look a bit strange).
According to Managing Python Pipeline Dependencies, we can even
get rid of the requirements.txt file and instead, add all packages contained in requirements.txt to the install_requires field of the setup call
However, I created both files, and specified their path in Dockerfile:
ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
What I add as dependency is mecab-python3
and added it to both files, and this python lib depends on packages that can be installed with apt-get install mecab
. I can use the lib with import MeCab
. (upper case M and C)
Then I created a job from Dataflow console, and it complains about no module named MeCab
.
Log message before above error (I added line break and removed some information to make it easier to read):
"Executing:
python /mecab/main.py
--requirements_file=/mecab/requirements.txt
--setup_file=/mecab/setup.py
--service_account_email=***
--output=***
--job_name=mecab-test-072101
--region=asia-northeast1
--template_location=gs://active_data/2020/07/16/test/staging/template_launches/2020-07-21_01_46_14-10867249433501164576/job_object
--temp_location=*** --staging_location=***
--input=***
--input-headers=***
--mecab-params=***
--runner=DataflowRunner
--project=***"
The setup.py
is based on this file
I've tried every pattern and the conclusion is:
requirements.txt
can be used in Dockerfile to install dependenciessetup.py
may be a better solution than requirements.txt
, for there could be some non-python dependencies need to be installed with apt-get
FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE
environment variable could be removed since I use setup.py
to manage dependencies.I wonder which one get used first, requirements.txt
or setup.py
?
If we decide to make documentation changes, tag me in any updates to the tutorial Dockerfile.
Greetings, we're closing this. Looks like the issue got resolved. Please let us know if the issue needs to be reopened.
In which file did you encounter the issue?
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/dataflow/flex-templates/streaming_beam/Dockerfile
Describe the issue
The last line of the above Dockerfile is
RUN pip install -U -r ./requirements.txt
.Why is this line needed?
According to this document, shouldn't we create a
setup.py
file and set environment variable FLEX_TEMPLATE_PYTHON_SETUP_FILE to specify all python dependencies?