Review Sherlock Extension on Practice Task

jc-cisneros commented 2 years ago

The goal of this issue is to capture feedback from lab members testing the Sherlock Extension on the Practice Task.

jc-cisneros commented 2 years ago

@gentzkow, @anacarolinapq, @snairdesai let me know what you think. Also @anacarolinapq could you please tag other lab members who might be trying out the Sherlock Extension task?

I think the purpose of the task is to give lab members experience with using Sherlock within their workflow. Thus, I think the current task should be modified to fit into the GitHub workflow. That would entail creating a new issue, creating a new branch, cloning the branch on Sherlock, editing a script in a module, submitting a job on Sherlock, committing the change, pushing it to the branch, and creating the pull request.

My suggestion for the extra task would be to cluster the standard errors at the county level on both regressions and adding a note on the tables mentioning the fixed effects the model uses and that the std. errors are clustered at the county level.

Below you can find my comments for each bullet on Sherlock Extension

—————————————————————

Log on to Sherlock following the instructions here.

This is helpful and clear

—————————————————————

Set up your personal directory under $OAK. You can review the Research Cluster section of the ra-manual if needed.

Some comments on the “Installing and loading software” subsection on the referenced RA Manual section:

Homebrew is not available on Sherlock (sudo and apt are also not available)
From the instructions it is not clear in what directory we are supposed to download and install additional software. For personal use or tests, it could make sense to do in the /home/users/<user id> directory. Maybe for software that will be used by all lab members it would make sense to do it on a shared directory.
It might be also useful to explicitly mention that Sherlock has a Linux OS and to follow the installation instructions from the relevant Linux version of the software the user wants to install. Run hostnamectl to see details of the OS

—————————————————————

Set up Dropbox, GitHub and Rclone on Sherlock. You can follow the instructions here and here.

“Setting up GitHub” was to me the most confusing instruction as there is no guidance on the cited documentation. In essence, what “setting up GitHub” means is to have authorization to perform git commands on your repositories (repositories for ongoing projects are usually private). To do this, it would be useful to point lab members to this page in the GitHub Docs.

—————————————————————

Clone the template repository under your perosnal (TYPO) directory.

If the lab member successfully connected to GitHub (use ssh -T git@github.com to test ssh connection to GitHub), then this simply requires running git clone <repository SSH link> on the $OAK/<userid> directory.

—————————————————————

Set up the repository following the instructions in the ReadMe.

The first step is to create a config_user.yaml file in the root directory. To use the config_user.yaml template, you can run the following command from the setup folder: cp config_user_template.yaml /oak/stanford/groups/gentzkow/<user id>/<repo directory>/config_user.yaml

Usually this step can be skipped if we plan to use the default file created by fetching the gslab_make submodule files. Nevertheless, fetching the gslab_make submodules files did not create a config_user.yaml file for me. We should confirm if this happens to other lab members and try to find out why.

The second step is to install miniconda and oracle-idk. As mentioned above, Homebrew is not available on Sherlock. For practice sake, lab users could install both of these packages in their home/users/<user id> directory in Sherlock.

—————————————————————

Run the data module within your terminal.

Before this bullet, we could include checking if the setup process was done correctly as a step. If you load all the required modules in Sherlock, running check_setup.py will give you a “SUCCESS! Setup complete.” message. After that is done, running the data module is done exactly as in the previous task (i.e. running the make.py script on the data module).

—————————————————————

Submit a job to run the full repository, following the instructions here.

Here I would add a reference to the Research clusters section in the RA Manual. I would even go further to make everyone’s life easier and provide one (or several) bash script templates for use in the lab. The submit.sh bash script I used for the task was:

#!/bin/bash
#
#SBATCH --job-name=run_all
#SBATCH --time=2:00:00
#SBATCH --mem=50GB
#SBATCH --partition=gentzkow
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<sunet id>@stanford.edu
#SBATCH --ntasks=1

ml python/3.6.1
srun python run_all.py

—————————————————————

Additional comments

If we decide to include the extra task (clustering std. errors at county level), it would be useful to add guidance on SSHFS. I provided some info related to SSHFS below.

The task could be done directly on the terminal (i.e. editing the Python script using nano), using SSHFS to edit the script on your preferred IDE, or using the longer process described below:

1) Pull the issue branch to your local machine 2) Edit the python script on an IDE in your local machine 3) Commit and push the changes on the script 4) Pull the modified issue branch to Sherlock 5) Submit a job to run the full repository 6) Commit and push the changes

Accessing files from Sherlock on your local machine (i.e. to edit scripts using Visual Studio, Stata, or RStudio)

Check Sherlock SSHFS Documentation.

Things to keep in mind:
- The sample code in the documentation is a way to connect you with the home directory on Sherlock. I adapted the code in the following way to connect to $OAK:
$ mkdir ~/sherlock_oak (this creates an empty directory in your local Home directory) $ sshfs <sunet id>@login.sherlock.stanford.edu:/oak/stanford/groups/gentzkow ~/sherlock_oak(this mounts the $OAK folder in Sherlock to your empty sherlock_oak folder)
- ALWAYS UNMOUNT AFTER YOU ARE DONE EDITING ($ umount ~/sherlock_oak). If macOS goes to sleep, you can get your IP blacklisted

snairdesai commented 2 years ago

Following up with a few additional comments.

—————————————————————

Set up your personal directory under $OAK. You can review the Research Cluster section of the ra-manual if needed.

There are some additional commands we might want to mention in the RA manual which @jc-cisneros and I found helpful during setup. These include: module spider and module purge. Both are covered in Sherlock documentation.
We also need to consider if we want to keep the personal directories of RAs who have already left the lab, and if not, where to store their work (this can be done through the Dropbox Integration).

—————————————————————

Set up Dropbox, GitHub and Rclone on Sherlock. You can follow the instructions here and here.

Clarifying that "GitHub setup" here is simply creating a new SSH key which links to Sherlock, and then proceeding to run standard git commands in the terminal (git clone, git remote -v, etc.). As @jc-cisneros the instructions could specify this, as we originally thought we were attempting to integrate GitHub as an entire platform. RClone does not support GitHub integration in this way.

—————————————————————

Set up the repository following the instructions in the ReadMe.

All of the added instructions are correct, it is important to note that you may need to restart the terminal at various iterations during the initial setup to allow Sherlock to process changes (particularly when loading in Miniconda).
The steps to download miniconda and oracle-jdk are not as intuitive. Sherlock support suggested installing this directly on Sherlock might be problematic, and conflict with existing modules. Here is the process we followed for each software:
- Visit the websites hosting source code for miniconda and oracle-jdk.
- Copy the installation code scripts (which are the addresses to the installation files) to your local computers. The suffixes should be .sh and .tar, respectively.
- On Sherlock: navigate to your home user repository (you can simply use the cd command in terminal).
- Load the curl module with module spider curl, and ensure it is up to date by specifying ml curl <latest version>.
- Input the command curl -O <source file link> and tar zxf <source file link>, where "<source file link>" refers to the installation file scripts pulled on your local computer (for miniconda, this may be a path rather than a raw file).
  - For miniconda, an alternative approach to utilizing curl is to simply input: wget <source file link>.
- Follow the prompts in the .sh (miniconda) and .tar (oracle-jdk) files to initialize the softwares.

—————————————————————

Submit a job to run the full repository, following the instructions here.

Agree with @jc-cisneros here, a sample batch script would be very helpful.

gentzkow commented 2 years ago

Thanks @jc-cisneros @snairdesai for all the great work here. Having your eyes on how to improve this task is very helpful. I'm happy to defer to you on the details, but the changes you outline above look broadly fine.

One note: Let's keep a clear distinction between instructions that are specific to the practice task and instructions that apply more broadly to setting up and using Sherlock. The latter should live in an appropriate place in the RA manual separate from the practice task. The practice task instructions can then link to them.

snairdesai commented 2 years ago

@jc-cisneros and I made a revised Markdown which can be integrated into the RA manual if helpful. We edited both the Practice Task section and the Research Clusters section. You can find our script here.

Tagging here for edits/feedback: @gentzkow @anacarolinapq.

gentzkow commented 2 years ago

Thanks @snairdesai. This looks good to me on a quick review.

One thing to flag for further discussion: We've run into various points at which the use of Conda for the template causes friction. The whole point of using Conda is to reduce friction. Should we think about making the template more flexible so using Conda is optional rather than required?

Also, can you remind me why we need oracle-jdk?

snairdesai commented 2 years ago

@gentzkow

Regarding the point on Conda - Sherlock support suggested not to install Conda in parallel with the local system module for Python, as this would create conflicts. Their suggestion was either to use Conda independently (without loading in Python), or simply stick with the local Sherlock packages. The latter could make sense, because everyone using Sherlock is generally using the same version of packages (for Python, 3.6.1). I think Conda makes sense for local (non-Sherlock) use, though, to ensure there are no version conflicts across users.
@jc-cisneros and I are looking into why oracle-jdk is needed - just to clarify, are you asking in general, or specifically for Sherlock?

gentzkow commented 2 years ago

(1) OK, I agree about local use. The question is whether we could set up the template so you have the option not to use Conda. I think that might only require giving instructions for how to install the Python and R dependencies listed in setup/conda_env.yaml by hand.

(2) I meant in general. I just forget what we're using that for in the template.

snairdesai commented 2 years ago

@gentzkow @jc-cisneros @anacarolinapq

In response to (1) relating to Conda, we can make Conda optional for users. There are a couple of paths here:

As you noted, we can give instructions for how users can install the dependencies in setup/conda_env.yaml by hand.
We can write a script for non-conda users, which (a) catches version conflicts and (b) conditional on the user having correct application versions, installs all relevant package dependencies in setup/conda_env.yaml.

The first option comes with two key drawbacks. First, we have to change the instructions in the README every time a dependency is changed. This could lead to some errors on our side. Also, the burden on the user has increased. We would have to hope they install each dependency correctly, conditional on the instructions being up to date.

The second option might be more robust. The script can first check the user's local application versions (i.e., for Python and R), and compare these to the dependency versions listed in setup/conda_env.yaml. If the versions do not match, or the user has not installed a given software, we can break setup and prompt the user to do so. If the versions do match, we can run a script which installs all the package dependencies in setup/conda_env.yaml directly through pip and CRAN.

@jc-cisneros and I have built a basic script which attempts to do this. You can find it in the last code chunk here. We have also drafted revised template instructions to provide this functionality for non-conda users in the Colab. We can create an issue and move these to GitHub if helpful. The script currently runs successfully with template, notifying the user that they do not have a conda environment active on their local computer.

There are a few drawbacks/comments here:

For many of the conda_env.yaml files across our projects, the specific version numbers of packages are not listed. This will create issues if users do not use conda. For example, in /gentzkow/template, the linearmodels package does not have a version listed. The script we have created above will install the latest version if no version number is specified (in our case, this was 4.27). However, conda may be reading in an older version, even if this is not listed in conda_env.yaml (in our case, this was 4.24). The template will break if these versions conflict. To specify versions for packages, we need to use package==<version number> in conda_env.yaml. The check_local_setup.py script will take care of the rest. If we go with option 1 (instructions for users), we still need to explicitly specify version numbers in conda_env.yaml.
We may be able to integrate this code within check_setup.py, but as of now it is a separate file the user will need to run in the terminal, from the root directory of template (the user will require Python to do this, so we have added this to required applications at the top of template). The command is python check_local_setup.py.
The same problems discussed in prior threads relating to GitHub-specific packages apply. We will still need the user to utilize /setup/setup_r.r for any packages not on CRAN, and /setup/download_stata_ado.do for STATA installations. However, conda users have to do this as well.
We might be wary of packages which are specific to conda, and thus can only be installed with conda-forge rather than pip install. It's been difficult for us to think of these use cases, though.
Lastly, we currently pull in some features from gslab_make (although we can drop these, as they primarily relate to error message formatting). There was a decision in the past (in Issue #44 in gentzkow/template) to move conda activate below check_setup.py. This still makes sense for check_local_setup.py, but we may want to run git submodule init and git submodule update prior to activating conda, to ensure gslab_make is properly initialized.

In response to (2) relating to oracle-jdk, we have not had much luck determining why oracle is needed to run template. When we run the template with conda, but without installing or initializing oracle jdk, everything works as expected. @anacarolinapq and @szahedian were also unsure why oracle-jdk is needed. Here is the earliest commit I could find referencing oracle-jdk in the template (discussed in Issue #30 in gentzkow/template). There don't seem to be any indications of why it is a requirement in the thread.

Our best guess is that oracle-jdk might make the conda setup process slightly faster, but it can likely be dropped.

In terms of next steps, @jc-cisneros and I are now trying to see if it is possible to integrate our build script into Sherlock, since Conda is more likely to cause issues there.

gentzkow commented 2 years ago

Thanks @snairdesai.

My initial reaction is to prefer the first option (manual install) to the second (script). Here's my reasoning:

We expect the non-Conda use case to be relatively rare. Right now we're thinking it would apply to (i) install on Sherlock and other cluster environments and (ii) replication archives. For (i), we expect users will be pretty sophisticated and be able to handle the manual install and debugging of any version conflicts. For (ii), we hope to use Docker containers to control versions.
A script like this is great when it works but can also create an extra failure point to have to debug.
I definitely agree that we don't want to create instructions that need to be updated every time we change conda_env.yaml. But I'd think we could write the instructions in a generic enough way that that shouldn't usually be necessary. I.e., we can give general instructions for installing Python and R components that will apply even if new packages are added to those lists.

Let me know if that sounds right to you or if you think we should discuss further. You all are closer to the details than I am so my intuition may well be off.

On oracle-jdk, I'd vote we remove it from the template. We can revisit if we discover doing so causes problems.

snairdesai commented 2 years ago

@gentzkow

Understood, we were also thinking through the different user groups locally (ii) vs. in Sherlock (i), and how lab members would require less instruction in (i). If Docker will be implemented in the local case (ii), then your point on debugging also makes sense.

@jc-cisneros and I will update the template draft linked above to reflect this (including removing oracle-jdk). Happy to also draft generic instructions for a project as a test case if that would be useful.

We might consider slightly revising the message printed to the user in the case that check_setup.py fails (i.e., when the user both doesn't use conda, and doesn't properly install the package dependencies outlined in general instructions). We could handle the non-conda user case with something like: "Alternatively if you are not using conda, please ensure you have installed all required dependencies."

Also quick follow up to above, is there a reason we don't list the specific version numbers of packages in conda_env.yaml? Our guess is this might create some issues even if we go with the manual approach (unless Docker can resolve this).

gentzkow commented 2 years ago

Great. Thanks.

I like the idea of revising the error message.

I'd also be in favor of including version numbers for all packages.

snairdesai commented 2 years ago

For myself:

We are waiting on resolutions to gentzkow/template #56 and #59 prior to updating template above.

snairdesai commented 2 years ago

@gentzkow We (@jc-cisneros and I) have finished working through three important deliverables here:

1. Rewriting the instructions for the Sherlock Extension Practice Task. 2. Per this comment: creating an example markdown file (setup/dependencies.md) which provides non-conda users with instructions on how to manually install dependencies, both locally and on Sherlock. 3. Per this comment: revising setup/check_setup.py to (a) include a warning message for non-conda users, and (b) to robustly check if the user has installed all relevant package dependencies.

If 1. looks good to you (text is attached below), I will go ahead and make these edits directly in gslab-econ/ra-manual. We have committed the file additions in 2. and 3. in a new branch in gentzkow/template (issue18_ramanual_dependencysolves) in commit b3ed716, issued a new PR in template (#61) and assigned @szahedian as a reviewer. Following these checks, I will close this issue with a summary comment.

Practice Task Addition

2) Sherlock Extension

Log on to Sherlock following the instructions here. Ensure you have access to the correct partitions ($OAK, $HOME, $GROUP_HOME, $SCRATCH, and $GROUP_SCRATCH) with the sh_part command. If you do not have access to $OAK, ask a lab member for access.
Set up your personal directory under $OAK. You can review the Research Cluster section of the ra-manual if needed.
Set up Dropbox and Rclone on Sherlock. You can follow the instructions here and here.
Initialize GitHub usage on Sherlock by generating a new SSH key and adding it to Sherlock.
Clone the remote template repository to your personal directory in $OAK.
Set up the repository following the instructions in the ReadMe. To install miniconda, see the instructions in the Research Cluster section (this installation process is distinct from the standard template setup).
Note that if you are using miniconda, you should not load Python or R locally in Sherlock, as this will create conflicts with your conda environment.
To check that the template repository is operating as expected, run the data module within your terminal. You should see the standard output appear in your $OAK directory.
- Note: You will not need to set up command line usage on Sherlock - simply load the relevant applications with the module spider and ml <application/version> commands. For more on this, see Setting up the environment in the Research Cluster section.
Create a new issue following the guidelines on the RA Manual, and modify the code in analysis/code/analyze_data.py to cluster standard errors at the county-level.
- Note: You may either follow the instructions in the Research Cluster section to make these file edits directly in Sherlock, or make the edits locally and pull to your Sherlock directory from a remote Git repository.
Add a new regression table capturing the revised outputs to the paper in paper_slides/code/paper.lyx. Caption the table mentioning the two-way fixed effects the model uses, and that the std. errors are clustered at the county level.
Submit a job to run the full repository with these edits, following the instructions here.

gentzkow commented 2 years ago

Thanks @snairdesai. This sounds great!

Just a reminder that this task should be lower priority than any ongoing work for FB2020.

snairdesai commented 2 years ago

@szahedian Tagging this PR for review whenever you get a chance

snairdesai commented 2 years ago

@szahedian Just wanted to tag this PR again for review whenever you get a chance. In the meantime, I'll go ahead and close out this issue with a summary comment. Thanks again!

jc-cisneros commented 2 years ago

Summary

In this issue (#18), we (@jc-cisneros and @snairdesai) completed the following tasks:

Made revisions to the Sherlock practice task instructions in lab_manual.
Per https://github.com/gslab-econ/ra-manual/issues/18#issuecomment-1207484025: created an example markdown file (setup/dependencies.md) which provides non-conda users with instructions on how to manually install dependencies, both locally and on Sherlock.
Per https://github.com/gslab-econ/ra-manual/issues/18#issuecomment-1208390283: revised setup/check_setup.py to (a) include a warning message and installation instructions for non-conda users, and (b) to robustly check if the user has installed all relevant package dependencies. The associated PR for 2. and 3. has been created in #61 in gentzkow/template. The relevant changes from 1. will be made directly in lab_manual.

gentzkow / lab-manual-archive