gentzkow / lab-manual-archive

2 stars 2 forks source link

Review Sherlock Extension on Practice Task #18

Closed jc-cisneros closed 2 years ago

jc-cisneros commented 2 years ago

The goal of this issue is to capture feedback from lab members testing the Sherlock Extension on the Practice Task.

jc-cisneros commented 2 years ago

@gentzkow, @anacarolinapq, @snairdesai let me know what you think. Also @anacarolinapq could you please tag other lab members who might be trying out the Sherlock Extension task?

I think the purpose of the task is to give lab members experience with using Sherlock within their workflow. Thus, I think the current task should be modified to fit into the GitHub workflow. That would entail creating a new issue, creating a new branch, cloning the branch on Sherlock, editing a script in a module, submitting a job on Sherlock, committing the change, pushing it to the branch, and creating the pull request.

My suggestion for the extra task would be to cluster the standard errors at the county level on both regressions and adding a note on the tables mentioning the fixed effects the model uses and that the std. errors are clustered at the county level.

Below you can find my comments for each bullet on Sherlock Extension

—————————————————————

This is helpful and clear

—————————————————————

Some comments on the “Installing and loading software” subsection on the referenced RA Manual section:

“Setting up GitHub” was to me the most confusing instruction as there is no guidance on the cited documentation. In essence, what “setting up GitHub” means is to have authorization to perform git commands on your repositories (repositories for ongoing projects are usually private). To do this, it would be useful to point lab members to this page in the GitHub Docs.

—————————————————————

If the lab member successfully connected to GitHub (use ssh -T git@github.com to test ssh connection to GitHub), then this simply requires running git clone <repository SSH link> on the $OAK/<userid> directory.

—————————————————————

The first step is to create a config_user.yaml file in the root directory. To use the config_user.yaml template, you can run the following command from the setup folder: cp config_user_template.yaml /oak/stanford/groups/gentzkow/<user id>/<repo directory>/config_user.yaml

Usually this step can be skipped if we plan to use the default file created by fetching the gslab_make submodule files. Nevertheless, fetching the gslab_make submodules files did not create a config_user.yaml file for me. We should confirm if this happens to other lab members and try to find out why.

The second step is to install miniconda and oracle-idk. As mentioned above, Homebrew is not available on Sherlock. For practice sake, lab users could install both of these packages in their home/users/<user id> directory in Sherlock.

—————————————————————

Before this bullet, we could include checking if the setup process was done correctly as a step. If you load all the required modules in Sherlock, running check_setup.py will give you a “SUCCESS! Setup complete.” message. After that is done, running the data module is done exactly as in the previous task (i.e. running the make.py script on the data module).

—————————————————————

Here I would add a reference to the Research clusters section in the RA Manual. I would even go further to make everyone’s life easier and provide one (or several) bash script templates for use in the lab. The submit.sh bash script I used for the task was:

#!/bin/bash
#
#SBATCH --job-name=run_all
#SBATCH --time=2:00:00
#SBATCH --mem=50GB
#SBATCH --partition=gentzkow
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<sunet id>@stanford.edu
#SBATCH --ntasks=1

ml python/3.6.1
srun python run_all.py

—————————————————————

Additional comments

If we decide to include the extra task (clustering std. errors at county level), it would be useful to add guidance on SSHFS. I provided some info related to SSHFS below.

The task could be done directly on the terminal (i.e. editing the Python script using nano), using SSHFS to edit the script on your preferred IDE, or using the longer process described below:

1) Pull the issue branch to your local machine 2) Edit the python script on an IDE in your local machine 3) Commit and push the changes on the script 4) Pull the modified issue branch to Sherlock 5) Submit a job to run the full repository 6) Commit and push the changes

snairdesai commented 2 years ago

Following up with a few additional comments.

—————————————————————

  • Set up your personal directory under $OAK. You can review the Research Cluster section of the ra-manual if needed.

—————————————————————

  • Set up Dropbox, GitHub and Rclone on Sherlock. You can follow the instructions here and here.

—————————————————————

  • Set up the repository following the instructions in the ReadMe.

—————————————————————

  • Submit a job to run the full repository, following the instructions here.
gentzkow commented 2 years ago

Thanks @jc-cisneros @snairdesai for all the great work here. Having your eyes on how to improve this task is very helpful. I'm happy to defer to you on the details, but the changes you outline above look broadly fine.

One note: Let's keep a clear distinction between instructions that are specific to the practice task and instructions that apply more broadly to setting up and using Sherlock. The latter should live in an appropriate place in the RA manual separate from the practice task. The practice task instructions can then link to them.

snairdesai commented 2 years ago

@jc-cisneros and I made a revised Markdown which can be integrated into the RA manual if helpful. We edited both the Practice Task section and the Research Clusters section. You can find our script here.

Tagging here for edits/feedback: @gentzkow @anacarolinapq.

gentzkow commented 2 years ago

Thanks @snairdesai. This looks good to me on a quick review.

One thing to flag for further discussion: We've run into various points at which the use of Conda for the template causes friction. The whole point of using Conda is to reduce friction. Should we think about making the template more flexible so using Conda is optional rather than required?

Also, can you remind me why we need oracle-jdk?

snairdesai commented 2 years ago

@gentzkow

  1. Regarding the point on Conda - Sherlock support suggested not to install Conda in parallel with the local system module for Python, as this would create conflicts. Their suggestion was either to use Conda independently (without loading in Python), or simply stick with the local Sherlock packages. The latter could make sense, because everyone using Sherlock is generally using the same version of packages (for Python, 3.6.1). I think Conda makes sense for local (non-Sherlock) use, though, to ensure there are no version conflicts across users.
  2. @jc-cisneros and I are looking into why oracle-jdk is needed - just to clarify, are you asking in general, or specifically for Sherlock?
gentzkow commented 2 years ago

(1) OK, I agree about local use. The question is whether we could set up the template so you have the option not to use Conda. I think that might only require giving instructions for how to install the Python and R dependencies listed in setup/conda_env.yaml by hand.

(2) I meant in general. I just forget what we're using that for in the template.

snairdesai commented 2 years ago

@gentzkow @jc-cisneros @anacarolinapq


In response to (1) relating to Conda, we can make Conda optional for users. There are a couple of paths here:

The first option comes with two key drawbacks. First, we have to change the instructions in the README every time a dependency is changed. This could lead to some errors on our side. Also, the burden on the user has increased. We would have to hope they install each dependency correctly, conditional on the instructions being up to date.

The second option might be more robust. The script can first check the user's local application versions (i.e., for Python and R), and compare these to the dependency versions listed in setup/conda_env.yaml. If the versions do not match, or the user has not installed a given software, we can break setup and prompt the user to do so. If the versions do match, we can run a script which installs all the package dependencies in setup/conda_env.yaml directly through pip and CRAN.

@jc-cisneros and I have built a basic script which attempts to do this. You can find it in the last code chunk here. We have also drafted revised template instructions to provide this functionality for non-conda users in the Colab. We can create an issue and move these to GitHub if helpful. The script currently runs successfully with template, notifying the user that they do not have a conda environment active on their local computer.

There are a few drawbacks/comments here:

  1. For many of the conda_env.yaml files across our projects, the specific version numbers of packages are not listed. This will create issues if users do not use conda. For example, in /gentzkow/template, the linearmodels package does not have a version listed. The script we have created above will install the latest version if no version number is specified (in our case, this was 4.27). However, conda may be reading in an older version, even if this is not listed in conda_env.yaml (in our case, this was 4.24). The template will break if these versions conflict. To specify versions for packages, we need to use package==<version number> in conda_env.yaml. The check_local_setup.py script will take care of the rest. If we go with option 1 (instructions for users), we still need to explicitly specify version numbers in conda_env.yaml.
  2. We may be able to integrate this code within check_setup.py, but as of now it is a separate file the user will need to run in the terminal, from the root directory of template (the user will require Python to do this, so we have added this to required applications at the top of template). The command is python check_local_setup.py.
  3. The same problems discussed in prior threads relating to GitHub-specific packages apply. We will still need the user to utilize /setup/setup_r.r for any packages not on CRAN, and /setup/download_stata_ado.do for STATA installations. However, conda users have to do this as well.
  4. We might be wary of packages which are specific to conda, and thus can only be installed with conda-forge rather than pip install. It's been difficult for us to think of these use cases, though.
  5. Lastly, we currently pull in some features from gslab_make (although we can drop these, as they primarily relate to error message formatting). There was a decision in the past (in Issue #44 in gentzkow/template) to move conda activate below check_setup.py. This still makes sense for check_local_setup.py, but we may want to run git submodule init and git submodule update prior to activating conda, to ensure gslab_make is properly initialized.

In response to (2) relating to oracle-jdk, we have not had much luck determining why oracle is needed to run template. When we run the template with conda, but without installing or initializing oracle jdk, everything works as expected. @anacarolinapq and @szahedian were also unsure why oracle-jdk is needed. Here is the earliest commit I could find referencing oracle-jdk in the template (discussed in Issue #30 in gentzkow/template). There don't seem to be any indications of why it is a requirement in the thread.

Our best guess is that oracle-jdk might make the conda setup process slightly faster, but it can likely be dropped.


In terms of next steps, @jc-cisneros and I are now trying to see if it is possible to integrate our build script into Sherlock, since Conda is more likely to cause issues there.

gentzkow commented 2 years ago

Thanks @snairdesai.

My initial reaction is to prefer the first option (manual install) to the second (script). Here's my reasoning:

  1. We expect the non-Conda use case to be relatively rare. Right now we're thinking it would apply to (i) install on Sherlock and other cluster environments and (ii) replication archives. For (i), we expect users will be pretty sophisticated and be able to handle the manual install and debugging of any version conflicts. For (ii), we hope to use Docker containers to control versions.

  2. A script like this is great when it works but can also create an extra failure point to have to debug.

  3. I definitely agree that we don't want to create instructions that need to be updated every time we change conda_env.yaml. But I'd think we could write the instructions in a generic enough way that that shouldn't usually be necessary. I.e., we can give general instructions for installing Python and R components that will apply even if new packages are added to those lists.

Let me know if that sounds right to you or if you think we should discuss further. You all are closer to the details than I am so my intuition may well be off.

On oracle-jdk, I'd vote we remove it from the template. We can revisit if we discover doing so causes problems.

snairdesai commented 2 years ago

@gentzkow

Understood, we were also thinking through the different user groups locally (ii) vs. in Sherlock (i), and how lab members would require less instruction in (i). If Docker will be implemented in the local case (ii), then your point on debugging also makes sense.

@jc-cisneros and I will update the template draft linked above to reflect this (including removing oracle-jdk). Happy to also draft generic instructions for a project as a test case if that would be useful.

We might consider slightly revising the message printed to the user in the case that check_setup.py fails (i.e., when the user both doesn't use conda, and doesn't properly install the package dependencies outlined in general instructions). We could handle the non-conda user case with something like: "Alternatively if you are not using conda, please ensure you have installed all required dependencies."

Also quick follow up to above, is there a reason we don't list the specific version numbers of packages in conda_env.yaml? Our guess is this might create some issues even if we go with the manual approach (unless Docker can resolve this).

gentzkow commented 2 years ago

Great. Thanks.

I like the idea of revising the error message.

I'd also be in favor of including version numbers for all packages.

snairdesai commented 2 years ago

For myself:

We are waiting on resolutions to gentzkow/template #56 and #59 prior to updating template above.

snairdesai commented 2 years ago

@gentzkow We (@jc-cisneros and I) have finished working through three important deliverables here:

1. Rewriting the instructions for the Sherlock Extension Practice Task. 2. Per this comment: creating an example markdown file (setup/dependencies.md) which provides non-conda users with instructions on how to manually install dependencies, both locally and on Sherlock. 3. Per this comment: revising setup/check_setup.py to (a) include a warning message for non-conda users, and (b) to robustly check if the user has installed all relevant package dependencies.

If 1. looks good to you (text is attached below), I will go ahead and make these edits directly in gslab-econ/ra-manual. We have committed the file additions in 2. and 3. in a new branch in gentzkow/template (issue18_ramanual_dependencysolves) in commit b3ed716, issued a new PR in template (#61) and assigned @szahedian as a reviewer. Following these checks, I will close this issue with a summary comment.


Practice Task Addition

2) Sherlock Extension

gentzkow commented 2 years ago

Thanks @snairdesai. This sounds great!

Just a reminder that this task should be lower priority than any ongoing work for FB2020.

snairdesai commented 2 years ago

@szahedian Tagging this PR for review whenever you get a chance

snairdesai commented 2 years ago

@szahedian Just wanted to tag this PR again for review whenever you get a chance. In the meantime, I'll go ahead and close out this issue with a summary comment. Thanks again!

jc-cisneros commented 2 years ago

Summary

In this issue (#18), we (@jc-cisneros and @snairdesai) completed the following tasks:

  1. Made revisions to the Sherlock practice task instructions in lab_manual.
  2. Per https://github.com/gslab-econ/ra-manual/issues/18#issuecomment-1207484025: created an example markdown file (setup/dependencies.md) which provides non-conda users with instructions on how to manually install dependencies, both locally and on Sherlock.
  3. Per https://github.com/gslab-econ/ra-manual/issues/18#issuecomment-1208390283: revised setup/check_setup.py to (a) include a warning message and installation instructions for non-conda users, and (b) to robustly check if the user has installed all relevant package dependencies. The associated PR for 2. and 3. has been created in #61 in gentzkow/template. The relevant changes from 1. will be made directly in lab_manual.