HertieDataScience / SyllabusAndLectures

Hertie School of Governance Introduction to Collaborative Social Science Data Analysis
MIT License
37 stars 60 forks source link

Quant challenges and issues for thesis #59

Closed laurencehendry closed 8 years ago

laurencehendry commented 8 years ago

Dear classmates,

My research partner Lukas and I have been trying to manage a large number of datasets needed for quantitative analysis for our thesis. We have begun experimenting with the Microsoft Excel Macros tool, since we did not have much success with R:

https://github.com/laurencehendry/SRISK_Thesis/blob/master/Gather

Challenges: For 413 datasets (all bank data; all with identical variables/columns; all in .CSV format) we are trying to:

  1. insert a column to the left for each individual dataset
  2. for each dataset, in the newly inserted column, repeat to the end of rows in that dataset the filename of that dataset (creating a new 'dataset name' variable)
  3. append all these datasets into one (very long) dataset

Issues:

  1. any advice for running an automated command in R for steps (1) and (2) above?
  2. for step (3) "rbind", "cbind" and "bind_rows" packages were returned as 'not available for R 3.2.2' upon trying to download and install them. I even tried re-installing R and RStudio. I don't know what the problem could be, and would like to know if anyone else has a similar problem with these packages?
  3. I have had experience using the handy "VLOOKUP" function in excel (using a 'unique identifier' for a given row -such as a bank name and date for instance- it can return a value from a corresponding sought after variable from another dataset; it can be helpful for combining specific information from multiple sources) and want to know if there is some easy substitute for this command in R?

Many, many thanks in advance for any advice or consideration, Laurence

christophergandrud commented 8 years ago

Hi

Is there anyway you could link to some of the underlying data you are trying to merge?

One quick thing: rbind and cbind are functions in base R, so you don't need to install anything. bind_rows is a function in dplyr. So just install that package.

laurencehendry commented 8 years ago

Hey

Thanks (also for replying on a Sunday!) The underlying data can be found here: https://drive.google.com/folderview?id=0B9VhVEv3LS_gemVMQXFQX21vNTg&usp=sharing

I've updated my script and also included a friend's code that I was using as an example.

christophergandrud commented 8 years ago

Could you possibly upload these in one zipped folder?

On Sun, Nov 8, 2015 at 5:23 PM laurencehendry notifications@github.com wrote:

Hey

Thanks (also for replying on a Sunday!) The underlying data can be found here:

https://drive.google.com/folderview?id=0B9VhVEv3LS_gemVMQXFQX21vNTg&usp=sharing

I've updated my script and also included a friend's code that I was using as an example.

— Reply to this email directly or view it on GitHub https://github.com/HertieDataScience/SyllabusAndLectures/issues/59#issuecomment-154842378 .

laurencehendry commented 8 years ago

Zip file: https://drive.google.com/file/d/0B9VhVEv3LS_gWkdjTEhMc2gxbVk/view?usp=sharing

christophergandrud commented 8 years ago

Try this code for binding all of your data together. (I'm not sure what the variable names are so you should put those in after running this:

library(rio)
library(dplyr)
library(DataCombine)

# Set directory to where the files are stored
setwd('~/Desktop/European_Firms/')

# Find a list of all files in the directory
all_files <- list.files()

# Create a NULL object to combine the individual files into
combined <- NULL

for (i in all_files) {
    # Load and combine only if the file is a csv
    if (tools::file_ext(i) == 'csv') {
        # Read file
        temp <- import(i)

        # Create file ID from the file name and move to the front
        file_id <- gsub(pattern = '.csv', replacement = '', i)
        message(file_id)
        temp$file_id <- file_id
        temp <- MoveFront(temp, 'file_id')

        # Bind to the main data frame
        combined <- bind_rows(combined, temp)

    } else {
        message('-- Not CSV --')
    }
}

Can you give an example of what you want to do for the third point?

laurencehendry commented 8 years ago

Many, many thanks!

Re third point: Included in the .zip file is an excel sheet ("Unique Identifiers") that contains variables on bank location and trading status. With VLOOKUP in excel (unavailable to us since the dataset will be 1m+ values) I would've run a function to look up the bank name amongst the Unique Identifiers and then return the country name and bank status in an adjacent column.

We also thought the function would make adding future variables simpler/more efficient (by adding e.g. bank characteristics to the Unique Identifiers, then running the VLOOKUP function again in our master dataset)

image

christophergandrud commented 8 years ago

why not just merge this information into the main data frame based on the file_id?

laurencehendry commented 8 years ago

Gah, still thinking liking an excel-user. I will use the merge function. Thanks again

ChristopherCosler commented 8 years ago

I had to do a very similar thing for my Master thesis. Solved it the following way. It works as long as you need all csv in a folder.

temp = list.files(pattern="*.csv") reviews = do.call("rbind", lapply(temp, function(x) read.csv(x, stringsAsFactors = FALSE)))