bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.87k stars 9.16k forks source link

[bitnami/airflow] module not found from plugins or nested modules #5647

Closed zaistev closed 3 years ago

zaistev commented 3 years ago

Which chart: chart: airflow-8.0.3
app version 2.0.1

Describe the bug I am currently using git-sync to fetch dags and plugins, It does fetch the code from a private git repository. However when I try to import a module from nested module it doesnt work. Ive checked this issue 2305, so I added init.py file to each sub folder to be recognized as python module, and changed the import structure to src.dags.plugins. it didnt work either.

it still throws an error. Is the current folder structure related to this issue? I am trying to figure out how to import from plugins (other git-repo) and from nested modules. I am missing sth in Dockerfile (currently empty).

Error

Broken DAG: [/opt/bitnami/airflow/dags/git-dataflow/data_pipeline_dag.py] Traceback (most recent call last): File "", line 219, in _call_with_frames_removed File "/opt/bitnami/airflow/dags/git-dataflow/data_pipeline_dag.py", line 9, in import dags.plugins as plugins ModuleNotFoundError: No module named 'dags'

values.yml

dags:

Enable in order to download DAG files from git repositories.

##

enabled: true repositories:

my current folder structure . ├── Dockerfile ├── README.md ├── init.py ├── requirements.txt ├── src │   ├── init.py │   └── dags │   ├── init.py │   ├── data_pipeline_dag.py │   └── plugins │   ├── init.py │   ├── operators │   │   └── init.py │   ├── tasks.py │   └── utils.py └── tests `

Expected behavior

Version of Helm and Kubernetes:

version.BuildInfo{Version:"v3.5.2", GitCommit:"167aac70832d3a384f65f9745335e9fb40169dc2", GitTreeState:"dirty", GoVersion:"go1.15.7"}

Additional context using KubernetesExecutor

javsalgar commented 3 years ago

Hi,

Are your dags located in a public repository that we could use to reproduce the issue?

zaistev commented 3 years ago

hi @javsalgar thanks for the reply. I published this repo with same structure https://github.com/zaistev/example-dags/tree/master/src/dags

dags:

Enable in order to download DAG files from git repositories.

##
enabled: true
repositories:
  - repository: https://github.com/zaistev/example-dags.git
    branch: master
    **name: dags**
    path: "src/dags"

Broken DAG: [/opt/bitnami/airflow/dags/git-dags/data_pipeline_dag.py] Traceback (most recent call last): File "", line 219, in _call_with_frames_removed File "/opt/bitnami/airflow/dags/git-dags/data_pipeline_dag.py", line 8, in from plugins.tasks import (fetch_historical_to_s3, process_data, store_data, ModuleNotFoundError: No module named 'plugins'

unfortunately, it still throws an error. if I try to do what #2305 says in terms of module naming logic, then i should use the "git-dags" name.

so I did the following dags = importlib.import_module("git-dags")

then to import the right func, my change led to. dags.plugins.tasks.fetch_historical_to_s3

which also throws an error

Broken DAG: [/opt/bitnami/airflow/dags/git-dags/data_pipeline_dag.py] Traceback (most recent call last): File "", line 219, in _call_with_frames_removed File "/opt/bitnami/airflow/dags/git-dags/data_pipeline_dag.py", line 31, in python_callable=dags.plugins.tasks.fetch_historical_to_s3, AttributeError: module 'git-dags' has no attribute 'plugins'

my assumptions are that cant be folder structure, otherwise wouldnt read at all. same for naming conventions or executor, I tried celery and k8s and got same error. perhaps definition of pythonpath.

zaistev commented 3 years ago

there is a workaround to this problem in the dag file, just manually import the path, ugly af but works.

import sys sys.path sys.path.append('/opt/bitnami/airflow/git-<git.plugins.repositories[0].name>')

javsalgar commented 3 years ago

Hi,

Wow, this is strange. So, if I understand correctly, if in the chart you specify the extraEnvVars values (there is one for each component) with PYTHONPATH (and adding the proper dirs), then it should work, right? Just to confirm that this is enough and with that you don't have to use the sys.path.append function.

zaistev commented 3 years ago

@javsalgar in my dag file "data_pipeline_dag.py" I added these lines:

import sys sys.path sys.path.append('/opt/bitnami/airflow/git-dataflow-plugins')

"git-dataflow-plugins" is a reference to the name definition in the yml file for git.plugins.repositories[0].name

plugins:

Enable in order to download Plugins files from git repositories.

##
enabled: true
repositories: 
  - repository: ...
    branch: ...
    name: **dataflow-plugins**
    path: "src/plugins"

from your comments, instead of doing the manual import and appending the dir. is it possible to do it in the extraEnvVars? shouldnt be automatic?

Cheers

javsalgar commented 3 years ago

Hi,

I'm not sure if the structure of this DAG is different (in our tests we import a DAG have no issues) and that's why it requires extra changes like the one you were doing. The first to do is to confirm that this works

extraEnvVars:
  - name: PYTHONPATH
    value: "/opt/bitnami/airflow/git-dataflow-plugins"
jvdgoltz commented 3 years ago

I have a similar error for a plugin ModuleNotFoundError: No module named 'PLUGIN_NAME', the PLUGIN_NAME.py is inside REPO_NAME/airflow/plugins in our repository.

After describing and sshing into the airflow webserver pod I think the reason is that the plugins folder is wrongly mounted: kubectl describe pod airflow-web-YADATADA -n airflow gives me under volume mounts:

So I will circumvent the issue this by adjusting AIRFLOW__CORE__PLUGINS_FOLDER but that is obviously only fighting a symptom. But the proper fix should be easy to do. Looking forward to hear an update :)

park-peter commented 3 years ago

Setting the PYTHONPATH in the extraEnvVars won't work. I found that the issue is mitigated when you SSH into the deployment and run airflow db upgrade. Since airflow db upgrade or airflow upgradedb is idempotent, can we somehow run the command everytime the git is pulled using git-sync? Not sure if the problem underlies with git-sync not being part of initContainer and the plugins folder is being initialized before the git-sync first runs? Just throwing bunch of ideas.

javsalgar commented 3 years ago

Hi,

There's something I don't understand about it. I see that there is a first git clone so the git sync should not be related, right? Not sure If I'm missing anything. By the way, @miguelaeh is working on transitioning the airflow containers to bash, so I will ping him so he can check if the new version of the container fixes the issue.

park-peter commented 3 years ago

Thank you. Running airflow db upgrade was not appropriate as a long term solution as we may be doing multiple helm upgrades or docker image update and the plugins would fall off everytime. I fixed it by including git-sync within the initContainer of the deployments.

jonathan-foxx commented 3 years ago

Hi. I am facing similar issues. @spidman Can you please share how did you go about resolving this.

alebelucio commented 3 years ago

Hi,

In our environment we have dags and plugins from git enabled to make it work we'd to add the lines in PYTHONPATH extra environment variables like below:

  dags:
    enabled: true
    repositories:
      - repository: https://<git-repo>
        branch: master
        name: dags
        path: airflow/dags

  plugins:
    enabled: true
    repositories:
      - repository: https://<git-repo>
        branch: master
        name: plugins
        path: airflow/plugins

extraEnvVars:
  - name: PYTHONPATH
    value: "/opt/bitnami/airflow/dags/git-dags:/opt/bitnami/airflow/git-plugins"
  - name: PYSPARK_PYTHON
    value: "/opt/bitnami/airflow/venv/bin/python"
javsalgar commented 3 years ago

Hi,

I think we should consider setting the PYTHONPATH variable by default. Any thoughts on this?

lukasstankiewicz commented 3 years ago

@javsalgar please do; as you need do it manually for dags & plugins; or just add it into documentation that is needed

javsalgar commented 3 years ago

Ok, I will forward this to the engineering team

pgrandjean commented 3 years ago

same problem, even with PYTHONPATH set as extraEnvVars

javsalgar commented 3 years ago

Could there be another missing path to load? Which is the exact error?

ShawnMcGough commented 3 years ago

Experiencing this bug as well. The path is missing the plugins segment, so Airflow can't find the plugins that are being sync'ed.

Should be: /opt/bitnami/airflow/plugins/git_{deployname} But Bitnami is putting it here (one directory too shallow): /opt/bitnami/airflow/git_{deployname}

For the git DAG sync, the path is correct: /opt/bitnami/airflow/dags/git_{deployname}

I tried creating a symlink to fix, but Airflow did not like that at all (didn't follow the symlink once loaded).

This next thing didn't work for me - since the path is still incorrect:

extraEnvVars:
  - name: PYTHONPATH
    value: "/opt/bitnami/airflow/dags/git-dags:/opt/bitnami/airflow/git-plugins"

perhaps this would work?

extraEnvVars:
  - name: PYTHONPATH
    value: "/opt/bitnami/airflow/dags/git-{deployname}:/opt/bitnami/airflow/git-{deployname}"

I just created a custom image for now with the plugin I needed to proceed. It would be nice to have a fix to enable git sync again.

ShawnMcGough commented 3 years ago

@javsalgar I've located the bug in the git template file _git_helpers.tpl. Please forward to engineering team - should be quick fix.

https://github.com/bitnami/charts/blob/341547ae17191b577353fa254f1743f34447be25/bitnami/airflow/templates/_git_helpers.tpl#L45

Should be:

  mountPath: /opt/bitnami/airflow/plugins/git_{{ include "airflow.git.repository.name" . }}
javsalgar commented 3 years ago

Thanks for letting us know! As you spotted the issue, would you like to contribute with a PR?