MattTriano / analytics_data_where_house

An analytics engineering sandbox focusing on real estates prices in Cook County, IL
https://docs.analytics-data-where-house.dev/
GNU Affero General Public License v3.0
9 stars 0 forks source link

Fix great_expectations workflow to be run from the airflow_scheduler container #124

Closed MattTriano closed 1 year ago

MattTriano commented 1 year ago

Earlier in development, great_expectations expectation and checkpoint development was done in a separate container/service (py-utils), which was also in charge of some initial setup tasks. This was very cumbersome and clunky, and when I refactored the startup process to just work with a venv, I eliminated that container. It looks like I made some motions to transition to serving jupyterlab from the airflow-scheduler service (namely installing relevant packages in the airflow-related image and setting the GE_JUPYTER_CMD env-var that was previously used in the py-utils service, but if I figured out how to get GE running in the airflow-scheduler service's container back then, I should have documented it and memorialized the process, as permissions errors (related to the airflow Dockerfile changing the user from root to airflow) prevent the container from serving jupyter notebook/lab.

user@host_in_container:~$ jupyter lab --no-browser --port=18888
...
notebook_shim | error linking extension: [Errno 13] Permission denied: '/home/airflow/.local/share/jupyter/runtime'
    Traceback (most recent call last):
      File "/home/airflow/.local/lib/python3.9/site-packages/traitlets/traitlets.py", line 656, in get
        value = obj._trait_values[self.name]
    KeyError: 'browser_open_file'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/home/airflow/.local/lib/python3.9/site-packages/traitlets/traitlets.py", line 656, in get
        value = obj._trait_values[self.name]
    KeyError: 'runtime_dir'
...
PermissionError: [Errno 13] Permission denied: '/home/airflow/.local/share/jupyter/runtime'

From researching, I've found the default paths jupyter uses to run a notebook/lab server, as well as the names of env-vars that can set different locations.

user@host_in_container:~$ jupyter --paths
config:
    /home/airflow/.jupyter
    /home/airflow/.local/etc/jupyter
    /usr/local/etc/jupyter
    /etc/jupyter
data:
    /home/airflow/.local/share/jupyter
    /usr/local/share/jupyter
    /usr/share/jupyter
runtime:
    /home/airflow/.local/share/jupyter/runtime

I've hacked together a strategy for sorting this out, but it feels a bit like a hack. It involves setting env vars the following env-vars in the .env file

JUPYTER_CONFIG_DIR="/opt/airflow/.jupyter"
JUPYTER_RUNTIME_DIR="/opt/airflow/.jupyter/share/jupyter/runtime"

and making a makefile recipe that makes those dirs before starting up a jupyterlab server

serve_great_expectations_jupyterlab:
    docker compose exec airflow-scheduler /bin/bash -c \
        "mkdir -p /opt/airflow/.jupyter/share/jupyter/runtime &&\
        cd /opt/airflow/great_expectations/ &&\
        jupyter lab --ip 0.0.0.0 --port 18888"

but it feels really hacky to make the dirs this way rather than making them in the Dockerfile. Still, I did try making those dirs in the Dockerfile, and despite many chmodding experiments, my attempts failed as the airflow user is not the same as the default user that I have when I get a shell in a container).

MattTriano commented 1 year ago

At present, running that makefile recipe starts up a jupyterlab server in the /airflow/great_expectations/ dir (which is mounted as a volume), but you still have to manually copy the URL to the server and paste it into a browser.

MattTriano commented 1 year ago

Looks like I also need to set the JUPYTER_DATA_DIR env-var if I want to be able to save notebooks. Set to the path /opt/airflow/.jupyter/share/jupyter.