Running the code to save the intermediate data uses too much resources

jaepil-choi commented 3 months ago

Hello, I ran the code on my school's WRDS on SAS Studio (web) but WRDS stopped the process because it was using up too much resources (space 1623GB used)

I wanted to save firm-level data that are generated and pre-processed along the way so I changed the manual settings as below:

* Assign scratch and project folder names;
%let scratch_folder = ~/ReplicationCrisis/scratch; 
%let project_folder = ~/ReplicationCrisis/GlobalFactors;
* Set defaults;
%let delete_temp = 0;  * Should temporary files be deleted?;
%let save_csv = 1;     * Should the main data set be save country-by-country in a .csv format?;

However, as I stated above this seems to take up too much resources.

@xXComanderXx , I noticed in the closed issue that you sent the data when someone encountered WRDS issue after confirming that he had an access to the datasets used in this project.

All I wanted to do was to replicate the factor with raw data in Python for practice, so I was hoping if it would be possible for you to send me firm-level data files also if that's possible.

Here's my main.log to prove that my school does indeed have access to all the datasets needed to run this project.

Or better yet, could you tell me what options I should tweak to not use up 1600GB+ storage to run the code successfully with intermediate data files without WRDS killing my process?

mk0417 commented 3 months ago

I do not think you need to change the default setting delete_temp=1 to 0 if you need firm characteristics data. I have no problem to run the code to generate firm characteristics data with default setting.

What you need to change is the path of scratch_folder. The storage of your home directory in WRDS is limited, but you locate your scratch folder under your home directory. You need to set scratch folder under WRDS scratch folder. For example, suppose your institution is abc_university, then the scratch folder will be /scratch/abc_university. You just need to replace abc_university with your institution.

In addition, in the log file, I notice that you do not have access permssion of the CCM link table:

ERROR: User does not have appropriate authorization level for file CRSP.CCMXPF_LNKHIST.DATA.

Hope this can help.

theisij commented 3 months ago

@mk0417 is spot on (and thanks for helping!).

Also, you can download the firm-level data (i.e. the main output from the code in this repo) directly from WRDS's web interface here: https://wrds-www.wharton.upenn.edu/pages/get-data/contributed-data-forms/global-factor-data/

jaepil-choi commented 3 months ago

@mk0417 Thanks for letting me know. I thought the final output only generates factor return data as authors provided on their website so I presumed that firm-level datasets are deleted as the code runs. I thought setting delete_temp=0 would leave the data undeleted.

Also, thanks for pointing out that there is actually a problem with access level for a certain data table. My school said that they have most of CRSP data so I didn't know there were some datasets not covered.

... which leads to my second thanks to @theisij by providing a much easier and faster way to get firm level data the authors provided on WRDS. Had I known this was available for download I wouldn't have had to go through all the pain during the weekends!

Glad that I asked here for help. Bless you.

bkelly-lab / ReplicationCrisis

Running the code to save the intermediate data uses too much resources #6