current12 / Stat-222-Project

3 stars 0 forks source link

SCF Setup #52

Closed ijyliu closed 3 months ago

ijyliu commented 5 months ago

Instructions to get this repo set up on the SCF - gaining access to parallel compute and GPUs.

Request SCF Account

Fill out https://scf.berkeley.edu/account. Takes 1-2 business days for approval.

Login

ssh <username>@arwen.berkeley.edu

Repo Setup

First, create an ssh key:

https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent

Add it to github:

https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account?platform=linux

Create directory to put the repo in so paths work correctly

mkdir ~/repo
cd ~/repo

Clone the repo via ssh

git clone git@github.com:current12/Stat-222-Project.git

Set up Conda environment

cd ~/repo/Stat-222-Project
mamba env create -f environment_scf.yml

(mamba is a drop-in replacement for conda that works better)

Setting up .bashrc for convenient access to the repo and python environment

If you don't know/don't want to use vim to code on SCF, I suggest setting up the Remote-Development extension in VSCode to edit text files that way.

vim ~/.bashrc

Scroll to the bottom using down arrows, then press the i key to go into insert mode.

Add the following:

cd ~/repo/Stat-222-Project
mamba activate capstone_scf

Hit escape, then type :wq and press enter to save.

Now when you start a new bash session on SCF, you will be initialized in ~/repo/Stat-222-Project with the conda environment activated.

Running things on SCF

It seems annoying to move data to/from SCF (though maybe we could use the Box API), so I've started keeping SMALL and CLEAN PARQUET files in this repo. Please minimize the amount of data stored on GitHub - whenever you make new files that contain, for example, NLP features, only save the necessary variables (features themselves) and appropriate unique keys (ticker and fixed_quarter_date). Do not store additional copies of transcripts.

I think it's possible to run jupyter notebooks in batch mode but for now I've been sticking to writing .py scripts for stuff to run on the cluster.

SCF uses the SLURM scheduler.

Sbatch Scripts

To request compute, you need to write .sh scripts like the following

(CPU) https://github.com/current12/Stat-222-Project/blob/main/Code/Exploratory%20Data%20Analysis/All%20Data/EDA_NER_No_GPU.sh

(GPU) https://github.com/current12/Stat-222-Project/blob/main/Code/Exploratory%20Data%20Analysis/All%20Data/EDA_NER.sh

Submit job

sbatch <.sh script name>

View current jobs

squeue --user=<your username>

(Leave out --user flag and username if you want to view all jobs runing from everyone on the cluster)

Cancel a job

After running squeue, get the JobID

scancel <JobID>

Or you can run this to cancel all your jobs

scancel --user=<your username>

Parallelization through Slurm Arrays

The general idea is to write a .py script with a line at the top that will accept an integer input argument (an "array" number) from the SLURM .sh script. You could then, for example, have pandas load in a collection of rows from a data file based on that input. Slurm will spin up a python instance on each core you allocate (cpus-per-task setting) and run the python script with the array number as input.

https://gist.github.com/ijyliu/b42ff6fd4c05321f47167b6bc678cd3a

General Info

https://statistics.berkeley.edu/computing/getting-started and the links on the right hand side