Build fully featured Docker container

szahedian commented 2 years ago

Following https://github.com/gentzkow/template/issues/54#issuecomment-1145038002, in this issue we will

Compile a list of all non-conda dependencies that could be required for current or future projects. Off the top of my head, this would include Stata and LyX.

Write a Dockerfile that builds an environment supporting conda and non-conda dependencies. We have experience with this in https://github.com/gentzkow/template/issues/43.

The main improvement over #43 is the addition of Stata, using ideas from AEA Stata for Docker.

szahedian commented 2 years ago

Pushed first draft of Docker image. The image can run Stata and has a conda environment with dependencies installed.

To run this, a new user may have to make a few tweaks to their Docker configuration settings. Namely

Increasing Docker's allowable memory usage
Allowing Docker access to /Applications and ~/Documents i.e. wherever stata.lic lives and wherever the replication files live.

In its current state, this container has much of what we need. Next steps:

Add R conda dependencies to separate environment file and install those into conda after creating environment.
Investigate why AEA Stata template sets user to statauser:stata (see here).
Check that there are no additional dependencies that ought to be included.

jc-cisneros commented 2 years ago

I managed to replicate @szahedian last results. I effectively needed to increase Docker's allowable memory usage to 10 GB and allow access to /Applications to successfully build the image.

Some comments on the next steps:

I tested adding R and R requirements to the conda_env.yaml file and building the container image with that configuration. Doing that will definitely take quite a bit to run (it had not finished running after 25 minutes) Would it make a significant difference in runtime if you do it when you build the container image vs. when the image is already built?
Setting "USER root" gives the root access that is required to run apt get on Linux Debian (which is used to install lyx and miniconda). The AEA template sets "USER statauser:stata" probably to mean a generic Docker combination "USER userid:groupid" that is tracked by the Docker Daemon with permissions granted accordingly (see here).
We need to add git-lfs. This can be added to the "install preliminaries" layer in the Dockerfile.

Proposed next steps:

Add R dependencies to conda environment. Check if updating the conda environment once the container has already been built is a better option than simply running build_docker.sh on an extended conda_env.yaml file.
Extend the Docker container to work in any High Performance Computing infrastructure (e.g. Sherlock). This would require using Singularity and adapting the code so that it does not require root access (see Sherlock Documentation).

jc-cisneros commented 2 years ago

@gentzkow @szahedian

@snairdesai and I made progress on this issue. Following #59, we updated the Dockerfile to support both Python and R dependencies. Similar to what was reported in the local MacOS case, the computation times improved considerably (the environment was created in roughly 6 minutes when building the Docker container). We also added the installation of git-lfs in its respective layer in the Dockerfile. I opened PR https://github.com/gentzkow/template/pull/60 to bring these changes to the development branch (@snairdesai and I are not collaborators in gentzkow/template).

Next steps:

Work on a draft that includes all the instructions to set up and use the Docker container. These instructions should be contained within an additional section in gentzkow/template. We are planning to work on this in the Colab file Proposed Changes Markdown.ipynb from /gslab-econ/ra-manual #18, but we can also open a PR here in gentzkow/template if we are added as collaborators.

gentzkow commented 2 years ago

Thanks! Next steps sound good. I added @jc-cisneros and @snairdesai as collaborators.

snairdesai commented 2 years ago

@gentzkow @szahedian

Noting that @jc-cisneros and I met with the Hunt RAs this morning to discuss Docker integration into the standard workflow. They are now getting familiar with the template process within Docker, while we address their issues with the standard process (in #62).

We are also in the final stages of debugging Docker usage for WindowsOS, and hope to have an updated proposal for these users shortly.

jc-cisneros commented 2 years ago

@gentzkow @szahedian

I am happy to announce that the container runs successfully on both MacOS and WindowsOS. I will describe the latest updates and list the next steps.

Updates:

Image is now built to be used by a non-root user. Within the container, users are only able to write on folders of their ownership or where they have write permissions (this is configured on the Dockerfile). The user has its own home directory and the working directory is currently a mount of the project folder (e.g., the 'template' folder on your local computer that is created after running git clone <repo on GitHub>). This new configuration makes the image compatible with Singularity (which is used on Sherlock).
Changed the Lyx version installed (from Ubuntu to Debian).
Run scripts now load ssh keys, the stata license, and the Dropbox local folders as read-only volumes. We don't want the user to modify any of those files.
Conda is already initialized (and the shell restarted). Users would only need to conda activate their project's environment.

Next steps:

Test new developments with Hunt's RAs. Conveniently, they have both MacOS and WindowsOS laptops to robustly check the container.
@snairdesai and I already have drafted instructions for using Docker (both for general users and for lab members who would also need to build images for new projects). We also have a proposed "Docker Extension" for the Practice Tasks. We will polish this draft with feedback from Hunt's RAs and other SIEPR Predocs.

gentzkow commented 2 years ago

@jc-cisneros Thanks! This is exciting.

A question I'd be curious to get your thoughts on as we think about whether we want to use Docker as our default dev environment vs. only for releasing replication archives etc.:

To what extent would you expect that this will add another layer of failure points / debugging? Something we've run into again and again is that changes that are intended to make life easier can also create lots of new problems. The business with solving Conda environments is a good example of this. Problems are especially costly if only a small share of users know how to debug them. So I'm wondering how often we're going to see cases where some RA somewhere is working away and gets a Docker related error and then we have a very hard time figuring out how to debug that.

snairdesai commented 2 years ago

@gentzkow @jc-cisneros

(i) We think Docker will make reproducibility much easier for individuals external to the lab, both because of ease of use/accessibility, and because of additional features allowing for cross-platform compatibility. (ii) We acknowledge that for internal lab use (development), the process of building and pushing images will be challenging, and may increase the burden on RAs. We would need the majority (if not all) RAs to become familiar with this process, as each individual project will require a separate built image to be constructed. (iii) We have developed a testing approach for robustness of Docker across devices and users which is meant to ensure both clarity and generalizability to address some of your concerns above (outlined below).

We also think that regardless of whether we decide to use Docker for default development or for replication archives, it would be useful for lab members to become familiar with the environment.

We anticipate that within this development pipeline, building images in Docker has the highest probability of creating issues for RAs. Once the image is built, pulling and running containers is very straightforward - even for new users, or those on other operating systems (i.e., Windows) who have struggled with the standard template.

The process of building images does have a learning curve, and is not straightforward. We hope that the skeleton we have built in template (i.e., the addition of the Dockerfile, build_container.sh, and run_container.sh scripts) will be generalizable to other projects. However, as you note, currently only @jc-cisneros @szahedian and I understand how to build and push these images to DockerHub. In particular, if we need to add new softwares (i.e., Mathematica) to an image (especially those which are niche or non-standard) we may run into issues.

We've been considering how to step through these issues. Here's what we are planning:

When it comes to running containers from built images (relevant for public release):

[x] We will shortly thoroughly outline the approach to test subject(s) (i.e., SIEPR RA) who is hoping to utilize template moving forward. The goal here is for us to be able to clearly verbally communicate the purpose and features of Docker to someone entirely unfamiliar with the setup (clarity).
[ ] We will then provide proposed revisions to template instructions to other test subject(s) with no further instructions than what has been written in the revised template proposal. The goal here is for the revised template instructions to be detailed and directive enough to operate without raising any flags for the users (clarity).

When it comes to building images (relevant for internal lab development):

[x] The two of us have been re-installing and testing Docker on multiple Mac/Windows local machines. This has included rebuilding images multiple times with numerous edits (generalizability).
[ ] We are now at a stage where we can work with a more advanced set of users (i.e., Hunt RAs; potentially Jesse RAs) to build new images and test workflow across a variety of distinct projects. We want to ensure these users can build images for numerous projects (generalizability), and do so using the skeleton files in template (clarity).
[ ] Following this, we were hoping to discuss how these experiments worked with @gentzkow/Ana/Saam/Benji/Hannah, and jointly determine whether to use Docker solely for public release, or for both release and local development.

Happy to add/remove any steps from this testing pipeline if helpful.

gentzkow commented 2 years ago

Excellent. Thanks!

Message ID: @.***>

snairdesai commented 2 years ago

Update: Today we verbally described the purpose/features of Docker to a Heidi RA in the lab, and had them run the template container. It ran as expected, and took ~7 minutes to run template in its entirety. Seems like the process was straightforward and they found it valuable, so a promising start!

gentzkow commented 2 years ago

Thanks. Can you give me an idea of how the template could take 7 minutes to run? I would have thought all the scripts should be basically instantaneous.

snairdesai commented 2 years ago

@gentzkow

Apologies, that message was unclear - we are including the time it takes from a first-time user to sit down with the Docker instructions up until the time when they successfully run template, including all initial steps of setup. The run_all.py script is still nearly instantaneous.

The benefit of Docker is that individuals do not have to download any of our standard application requirements (i.e., Python, git-lfs, LyX, R, STATA), download/install conda, or set up command line usage. This will save a substantial amount of time for first-time users.

When using Docker,template setup takes around 7 minutes for new users because of the time it takes to (1) pull the Docker image we created with the relevant application installations from above and (2) to build the conda environment within the Docker container. Once the user has run template once within our Docker container, all of the applications will be cached and the conda environment will already be built, so the runtime will be substantially shorter.

We just timed how long it takes for users who have already built the image once. The entire process (starting from cloning template from GitHub through running run_all.py in the Docker container) took 2 minutes and 20 seconds. The run_all.py script takes ~30 seconds to run.

gentzkow commented 2 years ago

Got it. Thanks!

Is the ~30 seconds for run_all.py different from the time it takes when we're running it outside of Docker?

snairdesai commented 2 years ago

It's roughly the same as the time it takes in the standard template process. It differs for everyone depending on their local computing power, but Docker likely will not make the scripts any more/less efficient to run.

snairdesai commented 1 year ago

@gentzkow: I'm closing out this issue for the sake of housekeeping. @jc-cisneros and I do hope to revisit Docker, and also play around with GitHub codespaces as alternative development environments. We'll keep you posted on our progress here, and open new issues when needed.

snairdesai commented 1 year ago

Summary + Deliverables

In this issue (#56) @szahedian @jc-cisneros and I experimented with Docker as an alternative environment for the lab's development workflow. There are promising features of Docker which might support its future use in lab projects, which @jc-cisneros and I will explore further. The advent of GitHub codespaces and its integration with Docker is also an exciting prospect for future workflow.

A summary comment of the issue's original purpose can be found here. This Colab file provides an overview of some of the proposed changes @jc-cisneros and I are envisioning if we choose to integrate Docker into our workflow at a later date. In particular, reference the sections: "Working with Docker containers" and "Docker development workflow". A comment outlining some of the benefits of Docker for our workflow can be found here.

The commits in this branch (issue56-full_docker) include a minimally featured, but executable, Docker repository. We will leave this branch alive pending further work.

Stable link to issue branch here.

gentzkow / template_archive

Build fully featured Docker container #56

Summary + Deliverables