Closed jc-cisneros closed 2 years ago
@gentzkow, @anacarolinapq, @snairdesai let me know what you think. Also @anacarolinapq could you please tag other lab members who might be trying out the Sherlock Extension task?
I think the purpose of the task is to give lab members experience with using Sherlock within their workflow. Thus, I think the current task should be modified to fit into the GitHub workflow. That would entail creating a new issue, creating a new branch, cloning the branch on Sherlock, editing a script in a module, submitting a job on Sherlock, committing the change, pushing it to the branch, and creating the pull request.
My suggestion for the extra task would be to cluster the standard errors at the county level on both regressions and adding a note on the tables mentioning the fixed effects the model uses and that the std. errors are clustered at the county level.
Below you can find my comments for each bullet on Sherlock Extension
—————————————————————
This is helpful and clear
—————————————————————
Some comments on the “Installing and loading software” subsection on the referenced RA Manual section:
/home/users/<user id>
directory. Maybe for software that will be used by all lab members it would make sense to do it on a shared directory.It might be also useful to explicitly mention that Sherlock has a Linux OS and to follow the installation instructions from the relevant Linux version of the software the user wants to install. Run hostnamectl
to see details of the OS
—————————————————————
“Setting up GitHub” was to me the most confusing instruction as there is no guidance on the cited documentation. In essence, what “setting up GitHub” means is to have authorization to perform git commands on your repositories (repositories for ongoing projects are usually private). To do this, it would be useful to point lab members to this page in the GitHub Docs.
—————————————————————
If the lab member successfully connected to GitHub (use ssh -T git@github.com
to test ssh connection to GitHub), then this simply requires running git clone <repository SSH link>
on the $OAK/<userid>
directory.
—————————————————————
The first step is to create a config_user.yaml
file in the root directory. To use the config_user.yaml
template, you can run the following command from the setup folder:
cp config_user_template.yaml /oak/stanford/groups/gentzkow/<user id>/<repo directory>/config_user.yaml
Usually this step can be skipped if we plan to use the default file created by fetching the gslab_make
submodule files. Nevertheless, fetching the gslab_make
submodules files did not create a config_user.yaml
file for me. We should confirm if this happens to other lab members and try to find out why.
The second step is to install miniconda and oracle-idk. As mentioned above, Homebrew is not available on Sherlock. For practice sake, lab users could install both of these packages in their home/users/<user id>
directory in Sherlock.
—————————————————————
Before this bullet, we could include checking if the setup process was done correctly as a step. If you load all the required modules in Sherlock, running check_setup.py
will give you a “SUCCESS! Setup complete.” message. After that is done, running the data module is done exactly as in the previous task (i.e. running the make.py
script on the data module).
—————————————————————
Here I would add a reference to the Research clusters section in the RA Manual. I would even go further to make everyone’s life easier and provide one (or several) bash script templates for use in the lab. The submit.sh
bash script I used for the task was:
#!/bin/bash
#
#SBATCH --job-name=run_all
#SBATCH --time=2:00:00
#SBATCH --mem=50GB
#SBATCH --partition=gentzkow
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<sunet id>@stanford.edu
#SBATCH --ntasks=1
ml python/3.6.1
srun python run_all.py
—————————————————————
Additional comments
If we decide to include the extra task (clustering std. errors at county level), it would be useful to add guidance on SSHFS. I provided some info related to SSHFS below.
The task could be done directly on the terminal (i.e. editing the Python script using nano), using SSHFS to edit the script on your preferred IDE, or using the longer process described below:
1) Pull the issue branch to your local machine 2) Edit the python script on an IDE in your local machine 3) Commit and push the changes on the script 4) Pull the modified issue branch to Sherlock 5) Submit a job to run the full repository 6) Commit and push the changes
Accessing files from Sherlock on your local machine (i.e. to edit scripts using Visual Studio, Stata, or RStudio)
Check Sherlock SSHFS Documentation.
Things to keep in mind:
$ mkdir ~/sherlock_oak
(this creates an empty directory in your local Home directory)
$ sshfs <sunet id>@login.sherlock.stanford.edu:/oak/stanford/groups/gentzkow ~/sherlock_oak
(this mounts the $OAK folder in Sherlock to your empty sherlock_oak folder)
$ umount ~/sherlock_oak
). If macOS goes to sleep, you can get your IP blacklistedFollowing up with a few additional comments.
—————————————————————
- Set up your personal directory under $OAK. You can review the Research Cluster section of the ra-manual if needed.
module spider
and module purge
. Both are covered in Sherlock documentation. —————————————————————
git clone
, git remote -v
, etc.). As @jc-cisneros the instructions could specify this, as we originally thought we were attempting to integrate GitHub as an entire platform. RClone does not support GitHub integration in this way.—————————————————————
- Set up the repository following the instructions in the ReadMe.
.sh
and .tar
, respectively.cd
command in terminal).curl
module with module spider curl
, and ensure it is up to date by specifying ml curl <latest version>
.curl -O <source file link>
and tar zxf <source file link>
, where "<source file link
>" refers to the installation file scripts pulled on your local computer (for miniconda, this may be a path rather than a raw file).
curl
is to simply input: wget <source file link>
..sh
(miniconda) and .tar
(oracle-jdk) files to initialize the softwares.—————————————————————
- Submit a job to run the full repository, following the instructions here.
Thanks @jc-cisneros @snairdesai for all the great work here. Having your eyes on how to improve this task is very helpful. I'm happy to defer to you on the details, but the changes you outline above look broadly fine.
One note: Let's keep a clear distinction between instructions that are specific to the practice task and instructions that apply more broadly to setting up and using Sherlock. The latter should live in an appropriate place in the RA manual separate from the practice task. The practice task instructions can then link to them.
@jc-cisneros and I made a revised Markdown which can be integrated into the RA manual if helpful. We edited both the Practice Task section and the Research Clusters section. You can find our script here.
Tagging here for edits/feedback: @gentzkow @anacarolinapq.
Thanks @snairdesai. This looks good to me on a quick review.
One thing to flag for further discussion: We've run into various points at which the use of Conda for the template causes friction. The whole point of using Conda is to reduce friction. Should we think about making the template more flexible so using Conda is optional rather than required?
Also, can you remind me why we need oracle-jdk?
@gentzkow
(1) OK, I agree about local use. The question is whether we could set up the template so you have the option not to use Conda. I think that might only require giving instructions for how to install the Python and R dependencies listed in setup/conda_env.yaml
by hand.
(2) I meant in general. I just forget what we're using that for in the template.
@gentzkow @jc-cisneros @anacarolinapq
In response to (1) relating to Conda, we can make Conda optional for users. There are a couple of paths here:
setup/conda_env.yaml
by hand.setup/conda_env.yaml
.The first option comes with two key drawbacks. First, we have to change the instructions in the README every time a dependency is changed. This could lead to some errors on our side. Also, the burden on the user has increased. We would have to hope they install each dependency correctly, conditional on the instructions being up to date.
The second option might be more robust. The script can first check the user's local application versions (i.e., for Python and R), and compare these to the dependency versions listed in setup/conda_env.yaml
. If the versions do not match, or the user has not installed a given software, we can break setup and prompt the user to do so. If the versions do match, we can run a script which installs all the package dependencies in setup/conda_env.yaml
directly through pip and CRAN.
@jc-cisneros and I have built a basic script which attempts to do this. You can find it in the last code chunk here. We have also drafted revised template instructions to provide this functionality for non-conda users in the Colab. We can create an issue and move these to GitHub if helpful. The script currently runs successfully with template, notifying the user that they do not have a conda environment active on their local computer.
There are a few drawbacks/comments here:
conda_env.yaml
files across our projects, the specific version numbers of packages are not listed. This will create issues if users do not use conda. For example, in /gentzkow/template
, the linearmodels
package does not have a version listed. The script we have created above will install the latest version if no version number is specified (in our case, this was 4.27). However, conda may be reading in an older version, even if this is not listed in conda_env.yaml
(in our case, this was 4.24). The template will break if these versions conflict. To specify versions for packages, we need to use package==<version number>
in conda_env.yaml
. The check_local_setup.py
script will take care of the rest. If we go with option 1 (instructions for users), we still need to explicitly specify version numbers in conda_env.yaml
.check_setup.py
, but as of now it is a separate file the user will need to run in the terminal, from the root directory of template (the user will require Python to do this, so we have added this to required applications at the top of template). The command is python check_local_setup.py
. /setup/setup_r.r
for any packages not on CRAN, and /setup/download_stata_ado.do
for STATA installations. However, conda users have to do this as well. conda-forge
rather than pip install
. It's been difficult for us to think of these use cases, though. gslab_make
(although we can drop these, as they primarily relate to error message formatting). There was a decision in the past (in Issue #44 in gentzkow/template
) to move conda activate
below check_setup.py
. This still makes sense for check_local_setup.py
, but we may want to run git submodule init
and git submodule update
prior to activating conda, to ensure gslab_make
is properly initialized. In response to (2) relating to oracle-jdk, we have not had much luck determining why oracle is needed to run template. When we run the template with conda, but without installing or initializing oracle jdk, everything works as expected. @anacarolinapq and @szahedian were also unsure why oracle-jdk is needed. Here is the earliest commit I could find referencing oracle-jdk in the template (discussed in Issue #30 in gentzkow/template
). There don't seem to be any indications of why it is a requirement in the thread.
Our best guess is that oracle-jdk might make the conda setup process slightly faster, but it can likely be dropped.
In terms of next steps, @jc-cisneros and I are now trying to see if it is possible to integrate our build script into Sherlock, since Conda is more likely to cause issues there.
Thanks @snairdesai.
My initial reaction is to prefer the first option (manual install) to the second (script). Here's my reasoning:
We expect the non-Conda use case to be relatively rare. Right now we're thinking it would apply to (i) install on Sherlock and other cluster environments and (ii) replication archives. For (i), we expect users will be pretty sophisticated and be able to handle the manual install and debugging of any version conflicts. For (ii), we hope to use Docker containers to control versions.
A script like this is great when it works but can also create an extra failure point to have to debug.
I definitely agree that we don't want to create instructions that need to be updated every time we change conda_env.yaml
. But I'd think we could write the instructions in a generic enough way that that shouldn't usually be necessary. I.e., we can give general instructions for installing Python and R components that will apply even if new packages are added to those lists.
Let me know if that sounds right to you or if you think we should discuss further. You all are closer to the details than I am so my intuition may well be off.
On oracle-jdk, I'd vote we remove it from the template. We can revisit if we discover doing so causes problems.
@gentzkow
Understood, we were also thinking through the different user groups locally (ii) vs. in Sherlock (i), and how lab members would require less instruction in (i). If Docker will be implemented in the local case (ii), then your point on debugging also makes sense.
@jc-cisneros and I will update the template draft linked above to reflect this (including removing oracle-jdk). Happy to also draft generic instructions for a project as a test case if that would be useful.
We might consider slightly revising the message printed to the user in the case that check_setup.py
fails (i.e., when the user both doesn't use conda, and doesn't properly install the package dependencies outlined in general instructions). We could handle the non-conda user case with something like: "Alternatively if you are not using conda
, please ensure you have installed all required dependencies."
Also quick follow up to above, is there a reason we don't list the specific version numbers of packages in conda_env.yaml
? Our guess is this might create some issues even if we go with the manual approach (unless Docker can resolve this).
Great. Thanks.
I like the idea of revising the error message.
I'd also be in favor of including version numbers for all packages.
@gentzkow We (@jc-cisneros and I) have finished working through three important deliverables here:
1. Rewriting the instructions for the Sherlock Extension Practice Task.
2. Per this comment: creating an example markdown file (setup/dependencies.md
) which provides non-conda users with instructions on how to manually install dependencies, both locally and on Sherlock.
3. Per this comment: revising setup/check_setup.py
to (a) include a warning message for non-conda users, and (b) to robustly check if the user has installed all relevant package dependencies.
If 1. looks good to you (text is attached below), I will go ahead and make these edits directly in gslab-econ/ra-manual
. We have committed the file additions in 2. and 3. in a new branch in gentzkow/template
(issue18_ramanual_dependencysolves
) in commit b3ed716
, issued a new PR in template
(#61) and assigned @szahedian as a reviewer. Following these checks, I will close this issue with a summary comment.
$OAK
, $HOME
, $GROUP_HOME
, $SCRATCH
, and $GROUP_SCRATCH
) with the sh_part
command. If you do not have access to $OAK
, ask a lab member for access.$OAK
. You can review the Research Cluster section of the ra-manual if needed. template
repository to your personal directory in $OAK
.miniconda
, see the instructions in the Research Cluster section (this installation process is distinct from the standard template
setup). miniconda
, you should not load Python
or R
locally in Sherlock, as this will create conflicts with your conda
environment.template
repository is operating as expected, run the data
module within your terminal. You should see the standard output appear in your $OAK
directory.
module spider
and ml <application/version>
commands. For more on this, see Setting up the environment in the Research Cluster section.analysis/code/analyze_data.py
to cluster standard errors at the county-level.
paper_slides/code/paper.lyx
. Caption the table mentioning the two-way fixed effects the model uses, and that the std. errors are clustered at the county level.Thanks @snairdesai. This sounds great!
Just a reminder that this task should be lower priority than any ongoing work for FB2020.
@szahedian Tagging this PR for review whenever you get a chance
@szahedian Just wanted to tag this PR again for review whenever you get a chance. In the meantime, I'll go ahead and close out this issue with a summary comment. Thanks again!
In this issue (#18), we (@jc-cisneros and @snairdesai) completed the following tasks:
lab_manual
.setup/dependencies.md
) which provides non-conda users with instructions on how to manually install dependencies, both locally and on Sherlock.setup/check_setup.py
to (a) include a warning message and installation instructions for non-conda users, and (b) to robustly check if the user has installed all relevant package dependencies.
The associated PR for 2. and 3. has been created in #61 in gentzkow/template
. The relevant changes from 1. will be made directly in lab_manual
.
The goal of this issue is to capture feedback from lab members testing the Sherlock Extension on the Practice Task.