DataExpert-io / data-engineer-handbook

This is a repo with links to everything you'd ever want to learn about data engineering
20.25k stars 3k forks source link

[1-dimensional-data-modeling] env_file section should not have example.env? #137

Open gitgithan opened 3 days ago

gitgithan commented 3 days ago

This issue is both a suggestion and contains questions.

I was wondering why there were 2 items under env_file. I was expecting example.env to be a template to be copied and tweaked, and not actually used by a running docker. Another issue is having example.env makes users think docker-compose can interpolate docker-compose.yml, but my experiments show otherwise.

I investigated how precedence works, and what happens if multiple env_file contain the same variables. I found https://github.com/docker/compose/blob/v1/docs/Compose%20file%20reference%20(legacy)/version-3.md (this is version-3 readme, not sure what version our docker-compose.yml is) saying

Keep in mind that the order of files in the list is significant in determining the value assigned to a variable that shows up more than once. The files in the list are processed from the top down. For the same variable specified in file a.env and assigned a different value in file b.env, if b.env is listed below (after), then the value from b.env stands.

This description does not match the behavior in our setup. (see 2nd photo)

I checked that env | sort locally does not contain HOST_PORT to make the experiment accurate.

In 1st photo, I defined HOST_PORT only in example.env. I expected docker-compose to pick it up, but you can see the terminal output of 1st run saying HOST_PORT is not set and a random port 4041 was assigned locally.

I had to run a 2nd command with additional --env-file example.env for it to read the HOST_POST from example.env, so this goes against the docs above.

Also i'm confused why docker-compose config shows that without --env-file example.env it actually recognizes 5434 in example.env. I thought env_file section in docker-compose.yml and --env-file on cli are the same.

image

If i uncomment HOST_PORT in .env to make variables from multiple env files clash:

image

Notice config output stays at 5434 from example.env but the real port started is on 5433 following .env. Also this command was started without --env-file example.env.

Summary of what's confusing.

  1. docker-compose config and docker ps showing different ports (2nd photo). docker-compose config matches docs saying later file in list of env_file is read.
  2. HOST_PORT variable not set warning when it's already set in example.env, but can run properly when adding --env-file example.env on CLI (photo 1).

Related issue may be whether variables defined in env files can be interpolated at docker-compose.yml parsing time, or only later during container runtime. I can't find explicit mention whether files under env_file will be interpolated into docker-compose.yml. According to https://docs.docker.com/compose/how-tos/environment-variables/variable-interpolation/#ways-to-set-variables-with-interpolation, it works only if the variable is in .env, and env_file in yml is completely ignored. https://www.reddit.com/r/docker/comments/17ekhqy/comment/k658k18/ PaintDrinkingPete here also says the same.

In this case we want variables for both docker-compose.yml interpolation, and for container runtime variables (those under environment key). example.env is not helping with docker-compose.yml interpolation, thus suggest removing it from env_file section

gerrykou commented 1 day ago

I think this is an overcomplication of things.
This is a non production setup, so the easiest would be to just keep the example.env file there. If we want to highlight that .env files should not be exposed in github etc, we should keep only the .env file in the docker-compose.yaml. The Makefile will need to be updated too, to adapt to this change.

gitgithan commented 1 day ago

@gerrykou Thanks for raising the github exposure point. That clarifies that there are 2 separate questions.

  1. Should env files be in coded in docker-compose.yml
  2. Should env files be committed to git

Question 1 Both env files do not need to be there, in fact even the env_file key can be deleted too (I tested this) because .env is automatically searched for. (my suggestion in last paragraph of 1st message could be improved)

https://docs.docker.com/compose/how-tos/environment-variables/set-environment-variables/#use-the-env_file-attribute

A container's environment can also be set using .env files along with the env_file attribute.

This language implies that env_file is only needed if extra env files not called .env are used, otherwise, docker can automatically search .env placed at the root of the project directory next to the docker-compose.yml.

One argument for env_file section being there is for it to be self-documenting, like how EXPOSE doesn't actually bind ports (-p 8080:80 does the job) but documents what is exposed. However, I don't see any examples on docker docs listing .env under env_file section.

Thus, I would prefer the entire env_file section to be removed, and let docker automatically search .env. (PR https://github.com/DataExpert-io/data-engineer-handbook/pull/150 still has env_file section). This teaches students that "automatic discovery" is a thing, which helps them debug issues with pytest automatically discovering test folders by observing locations of __init__.py in future.

Question 2 In real work .env would not be on github, and example.env would be on github but with values removed. It was good the README required students to create .env from example.env to express the concept that example.env is a template, and I now realize it's just easier for bootcamp scaling/self-service purposes to prepopulate the values of example.env already.


Further improvements possible but not applicable to bootcamp

This bootcamp has only 1 env file (excluding template). If there are ever 2 env files (excluding template), then further specifications on optionality can be done: https://docs.docker.com/compose/how-tos/environment-variables/set-environment-variables/#additional-information-1

Would love to learn about how teachers can set things up correctly (demonstrating not just the end state, but real workflows) when teaching infrastructure topics, and similar problems faced.

Since this bootcamp aims to train engineers who can get jobs, and interviews would have rounds of code review, a student may fail in the operational details when asked to explain code, even if he gets all the concepts in the bootcamp, thus the emphasis on having a reason for every line of code.

gerrykou commented 1 day ago
  1. Yeah agree, as you said, i would keep .env file in the docker compose file for self-documenting.
  2. example.env file should be committed to git for documentation too, having dummy-placeholder or development values only.

So my suggestion on PR #150 is how i have done it and i would do it in my workflow.