Closed szahedian closed 1 year ago
Pushed first draft of Docker image. The image can run Stata and has a conda environment with dependencies installed.
To run this, a new user may have to make a few tweaks to their Docker configuration settings. Namely
/Applications
and ~/Documents
i.e. wherever stata.lic
lives and wherever the replication files live.In its current state, this container has much of what we need. Next steps:
statauser:stata
(see here).I managed to replicate @szahedian last results. I effectively needed to increase Docker's allowable memory usage to 10 GB and allow access to /Applications
to successfully build the image.
Some comments on the next steps:
apt get
on Linux Debian (which is used to install lyx and miniconda). The AEA template sets "USER statauser:stata" probably to mean a generic Docker combination "USER userid:groupid" that is tracked by the Docker Daemon with permissions granted accordingly (see here).Proposed next steps:
build_docker.sh
on an extended conda_env.yaml file.@gentzkow @szahedian
@snairdesai and I made progress on this issue. Following #59, we updated the Dockerfile to support both Python and R dependencies. Similar to what was reported in the local MacOS case, the computation times improved considerably (the environment was created in roughly 6 minutes when building the Docker container). We also added the installation of git-lfs in its respective layer in the Dockerfile. I opened PR https://github.com/gentzkow/template/pull/60 to bring these changes to the development branch (@snairdesai and I are not collaborators in gentzkow/template
).
Next steps:
gentzkow/template
. We are planning to work on this in the Colab file Proposed Changes Markdown.ipynb from /gslab-econ/ra-manual
#18, but we can also open a PR here in gentzkow/template
if we are added as collaborators.Thanks! Next steps sound good. I added @jc-cisneros and @snairdesai as collaborators.
@gentzkow @szahedian
Noting that @jc-cisneros and I met with the Hunt RAs this morning to discuss Docker integration into the standard workflow. They are now getting familiar with the template
process within Docker, while we address their issues with the standard process (in #62).
We are also in the final stages of debugging Docker usage for WindowsOS, and hope to have an updated proposal for these users shortly.
@gentzkow @szahedian
I am happy to announce that the container runs successfully on both MacOS and WindowsOS. I will describe the latest updates and list the next steps.
Updates:
git clone <repo on GitHub>
). This new configuration makes the image compatible with Singularity (which is used on Sherlock).conda activate
their project's environment.Next steps:
@jc-cisneros Thanks! This is exciting.
A question I'd be curious to get your thoughts on as we think about whether we want to use Docker as our default dev environment vs. only for releasing replication archives etc.:
To what extent would you expect that this will add another layer of failure points / debugging? Something we've run into again and again is that changes that are intended to make life easier can also create lots of new problems. The business with solving Conda environments is a good example of this. Problems are especially costly if only a small share of users know how to debug them. So I'm wondering how often we're going to see cases where some RA somewhere is working away and gets a Docker related error and then we have a very hard time figuring out how to debug that.
@gentzkow @jc-cisneros
(i) We think Docker
will make reproducibility much easier for individuals external to the lab, both because of ease of use/accessibility, and because of additional features allowing for cross-platform compatibility.
(ii) We acknowledge that for internal lab use (development), the process of building and pushing images will be challenging, and may increase the burden on RAs. We would need the majority (if not all) RAs to become familiar with this process, as each individual project will require a separate built image to be constructed.
(iii) We have developed a testing approach for robustness of Docker
across devices and users which is meant to ensure both clarity and generalizability to address some of your concerns above (outlined below).
We also think that regardless of whether we decide to use Docker
for default development or for replication archives, it would be useful for lab members to become familiar with the environment.
We anticipate that within this development pipeline, building images in Docker
has the highest probability of creating issues for RAs. Once the image is built, pulling and running containers is very straightforward - even for new users, or those on other operating systems (i.e., Windows) who have struggled with the standard template
.
The process of building images does have a learning curve, and is not straightforward. We hope that the skeleton we have built in template
(i.e., the addition of the Dockerfile
, build_container.sh
, and run_container.sh
scripts) will be generalizable to other projects. However, as you note, currently only @jc-cisneros @szahedian and I understand how to build and push these images to DockerHub
. In particular, if we need to add new softwares (i.e., Mathematica) to an image (especially those which are niche or non-standard) we may run into issues.
We've been considering how to step through these issues. Here's what we are planning:
When it comes to running containers from built images (relevant for public release):
template
proposal. The goal here is for the revised template instructions to be detailed and directive enough to operate without raising any flags for the users (clarity).When it comes to building images (relevant for internal lab development):
template
(clarity).Happy to add/remove any steps from this testing pipeline if helpful.
Excellent. Thanks!
Message ID: @.***>
Update: Today we verbally described the purpose/features of Docker to a Heidi RA in the lab, and had them run the template
container. It ran as expected, and took ~7 minutes to run template in its entirety. Seems like the process was straightforward and they found it valuable, so a promising start!
Thanks. Can you give me an idea of how the template could take 7 minutes to run? I would have thought all the scripts should be basically instantaneous.
@gentzkow
Apologies, that message was unclear - we are including the time it takes from a first-time user to sit down with the Docker instructions up until the time when they successfully run template
, including all initial steps of setup. The run_all.py
script is still nearly instantaneous.
The benefit of Docker is that individuals do not have to download any of our standard application requirements (i.e., Python
, git-lfs
, LyX
, R
, STATA
), download/install conda
, or set up command line usage. This will save a substantial amount of time for first-time users.
When using Docker,template
setup takes around 7 minutes for new users because of the time it takes to (1) pull the Docker
image we created with the relevant application installations from above and (2) to build the conda
environment within the Docker
container. Once the user has run template
once within our Docker
container, all of the applications will be cached and the conda
environment will already be built, so the runtime will be substantially shorter.
We just timed how long it takes for users who have already built the image once. The entire process (starting from cloning template
from GitHub
through running run_all.py
in the Docker
container) took 2 minutes and 20 seconds. The run_all.py
script takes ~30 seconds to run.
Got it. Thanks!
Is the ~30 seconds for run_all.py
different from the time it takes when we're running it outside of Docker
?
It's roughly the same as the time it takes in the standard template
process. It differs for everyone depending on their local computing power, but Docker
likely will not make the scripts any more/less efficient to run.
@gentzkow: I'm closing out this issue for the sake of housekeeping. @jc-cisneros and I do hope to revisit Docker
, and also play around with GitHub codespaces
as alternative development environments. We'll keep you posted on our progress here, and open new issues when needed.
In this issue (#56) @szahedian @jc-cisneros and I experimented with Docker
as an alternative environment for the lab's development workflow. There are promising features of Docker
which might support its future use in lab projects, which @jc-cisneros and I will explore further. The advent of GitHub codespaces
and its integration with Docker
is also an exciting prospect for future workflow.
A summary comment of the issue's original purpose can be found here. This Colab
file provides an overview of some of the proposed changes @jc-cisneros and I are envisioning if we choose to integrate Docker into our workflow at a later date. In particular, reference the sections: "Working with Docker containers" and "Docker development workflow". A comment outlining some of the benefits of Docker
for our workflow can be found here.
The commits in this branch (issue56-full_docker
) include a minimally featured, but executable, Docker
repository. We will leave this branch alive pending further work.
Following https://github.com/gentzkow/template/issues/54#issuecomment-1145038002, in this issue we will
The main improvement over #43 is the addition of Stata, using ideas from AEA Stata for Docker.