astronomer / ebook-etl-elt

8 stars 5 forks source link

Best practices for writing ETL and ELT pipelines

This repository contains the code for the webinar demo shown in: Best practices for writing ETL and ELT pipelines.

Watch the webinar here for free!

This repository is configured to spin up 6 Docker containers when you run astro dev start (See Install the Astro CLI).

The containers are:

To connect Airflow to both the Postgres database and MinIO, create a .env file in the root directory of the project with the exact contents of the .env.example file. Note that you need to restart the Airflow instance with astro dev restart after creating the .env file for the changes to take effect.

All the DAGs run without any further setup or tools needed!

Content

This repository contains:

All supporting SQL code is stored in the include folder.

The SQL code is repetitive for demo purposes, meaning you can manipulate the code for just one DAG to explore the DAGs without affecting other DAGs. In a real-world scenario you would likely modularize the SQL code further and avoid repetition.

How to run the demo

  1. Fork and clone this repository.
  2. Make sure you have the Astro CLI installed and that Docker is running.
  3. Copy the .env.example file to a new file called .env. If you want to use a custom XCom backend with MinIO uncomment the last 4 lines in the .env file.
  4. Run astro dev start to start the Airflow instance. The webserver with the Airflow UI will be available at localhost:8080. Log in with the credentials admin:admin.
  5. Run any DAG. They all are independent from each other.
  6. Use the query_tables DAG to check the number of records in the tables.

If you'd like to directly interact with the Postgres database, you can use the following commands to connect to the database:

docker ps

This command will list all the running containers. Look for the container with the image postgres:15.4-alpine. Copy the container ID (in the format 30cfd7660be9) and run the following command:

docker exec -it <container_id> psql -U postgres

You are now in a psql session connected to the Postgres database. You can list the tables with the command \dt and query the tables with SELECT * FROM <table_name>;.

Resources