LilMichelangelo2 / Amanta

0 stars 0 forks source link

test #7

Open LilMichelangelo2 opened 2 months ago

LilMichelangelo2 commented 2 months ago

@autopilot implement a panel regression approach for the following regression. Keep the fixed effects and the clustering:


realised_regression <- feols(Transformed_Value_Added ~ L1_team 
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))
codeautopilot[bot] commented 2 months ago

Potential solution

The task involves implementing a panel regression with fixed effects and clustered standard errors using the feols function from the fixest package in R. The user has provided the regression formula and the variables to be included in the model. The solution will involve writing R code that correctly specifies the model, including the fixed effects and clustering as requested.

How to implement

The implementation will consist of the following steps in the panel_regression_analysis.R file:

  1. Load the fixest library.
  2. Ensure the dataset realised_regression_data is loaded into the R environment.
  3. Define the regression model using the feols function, including the fixed effects (Fund_Family + TIMExOBJECTIVE) and clustering (cluster = c("FundId", "Date")).
  4. Run the regression and store the results in realised_regression.
  5. Output the regression summary.

Here is the R code to implement the solution:

# Load the necessary library
library(fixest)

# Assuming the dataset 'realised_regression_data' is already loaded or available in the environment.

# Define and run the panel regression model with fixed effects and clustering
realised_regression <- feols(Transformed_Value_Added ~ L1_team 
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))

# Output the results
summary(realised_regression)

This code should be placed in the panel_regression_analysis.R file, and it will execute the panel regression as per the user's request. The summary function call at the end will display the regression results, including the coefficients, standard errors, and statistical significance.

Click here to create a Pull Request with the proposed solution

Files used for this task:

Changes on panel_regression_analysis.R To solve the task described in the issue, we need to write R code in the `panel_regression_analysis.R` file to perform the panel regression analysis using the `feols` function from the `fixest` package. Here's a step-by-step plan to implement the regression analysis: 1. Load the necessary libraries. 2. Load the dataset `realised_regression_data`. 3. Define the regression model with fixed effects and clustering as specified by the user. 4. Run the regression using the `feols` function. 5. Output the results. Here's the concrete R code to perform these steps: ```r # Load the necessary library library(fixest) # Assuming the dataset 'realised_regression_data' is already loaded or available in the environment. # If not, you would need to load the data from a file, for example: # realised_regression_data <- read.csv("path_to_data_file.csv") # Define and run the panel regression model with fixed effects and clustering realised_regression <- feols(Transformed_Value_Added ~ L1_team + IMR + log(L1_Fund_Age) + L1_Expense_Ratio + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE, data = realised_regression_data, cluster = c("FundId", "Date")) # Output the results summary(realised_regression) ``` This code snippet should be placed in the `panel_regression_analysis.R` file. It will perform the panel regression analysis as requested by the user, including fixed effects for `Fund_Family` and `TIMExOBJECTIVE`, and clustering the standard errors by `FundId` and `Date`. The `summary` function is used to display the results of the regression.
LilMichelangelo2 commented 2 months ago

@autopilot is this a panel regression?

realised_regression <- feols(Transformed_Value_Added ~ L1_team 
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))
codeautopilot[bot] commented 2 months ago

@autopilot is this a panel regression?

realised_regression <- feols(Transformed_Value_Added ~ L1_team 
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))

Yes, the code you've provided is for a panel regression. The feols function from the fixest package in R is used to estimate fixed effects models, which are a common approach in panel data analysis. In your model, Fund_Family + TIMExOBJECTIVE are included as fixed effects, and the standard errors are clustered by FundId and Date, which is typical in panel data to account for within-group correlation.

LilMichelangelo2 commented 2 months ago

@autopilot correct and improve on the following code: Research how the effect of the L1_team dummy differs for the lowest tercile L1_Fund_Size versus the highest tercile by L1_Fund_Size

realised_regression <- feols(Transformed_Value_Added ~ L1_team
                             + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + log(L1_Tenure) + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Time"))
codeautopilot[bot] commented 2 months ago

@autopilot correct and improve on the following code: Research how the effect of the L1_team dummy differs for the lowest tercile L1_Fund_Size versus the highest tercile by L1_Fund_Size

realised_regression <- feols(Transformed_Value_Added ~ L1_team
                             + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + log(L1_Tenure) + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Time"))

To research how the effect of the L1_team dummy differs for the lowest tercile of L1_Fund_Size versus the highest tercile, you will need to create interaction terms between L1_team and dummy variables that indicate whether L1_Fund_Size is in the lowest or highest tercile. Here's the corrected and improved code:

# Load the necessary library
library(fixest)

# Assuming the dataset 'realised_regression_data' is already loaded or available in the environment.

# Create tercile indicators for L1_Fund_Size
realised_regression_data$tercile_low <- with(realised_regression_data, as.numeric(L1_Fund_Size <= quantile(L1_Fund_Size, probs = 1/3)))
realised_regression_data$tercile_high <- with(realised_regression_data, as.numeric(L1_Fund_Size > quantile(L1_Fund_Size, probs = 2/3)))

# Define and run the panel regression model with interaction terms and clustering
realised_regression <- feols(Transformed_Value_Added ~ L1_team * tercile_low + L1_team * tercile_high
                             + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + log(L1_Tenure) + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Time"))

# Output the results
summary(realised_regression)

This code will estimate the effect of L1_team separately for the lowest and highest terciles of L1_Fund_Size. The interaction terms L1_team * tercile_low and L1_team * tercile_high will allow you to see if the effect of L1_team is significantly different across these groups. The quantile function is used to determine the cutoff points for the terciles. The summary function will display the regression results, including the interaction terms.

LilMichelangelo2 commented 2 months ago

autopilot extend the code in R: add the column L1_team to every observation in scalability_full. L1_team takes the values of 0 or 1, depending on which value has the most observations per FundId per list element in innovator_list. Then, assign to each FundId the correct value. The Date per FundId is the list element name in innovator_list.

scalability_list <- lapply(innovator_list, function(df) {
  df %>%
    mutate(Neg_L1_Fund_Size_log = -L1_Fund_Size_log) %>%
    group_by(FundId) %>%
    do({
      tryCatch(
        tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
        error = function(e) {
          tibble(
            term = NA_character_,
            estimate = NA_real_,
            std.error = NA_real_,
            statistic = NA_real_,
            p.value = NA_real_
          )
        }
      )
    }) %>%
    ungroup()
})

scalability_full <- bind_rows(scalability_list, .id = "Date")
scalability_full$Date <- as.Date(scalability_full$Date)
codeautopilot[bot] commented 2 months ago

autopilot extend the code in R: add the column L1_team to every observation in scalability_full. L1_team takes the values of 0 or 1, depending on which value has the most observations per FundId per list element in innovator_list. Then, assign to each FundId the correct value. The Date per FundId is the list element name in innovator_list.

scalability_list <- lapply(innovator_list, function(df) {
  df %>%
    mutate(Neg_L1_Fund_Size_log = -L1_Fund_Size_log) %>%
    group_by(FundId) %>%
    do({
      tryCatch(
        tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
        error = function(e) {
          tibble(
            term = NA_character_,
            estimate = NA_real_,
            std.error = NA_real_,
            statistic = NA_real_,
            p.value = NA_real_
          )
        }
      )
    }) %>%
    ungroup()
})

scalability_full <- bind_rows(scalability_list, .id = "Date")
scalability_full$Date <- as.Date(scalability_full$Date)

To extend the code in R and add the L1_team column to every observation in scalability_full, you can follow these steps:

  1. Iterate over innovator_list and for each FundId within each list element (which corresponds to a Date), determine the most common value of L1_team.
  2. Create a mapping of FundId and Date to the most common L1_team value.
  3. Join this mapping with scalability_full to assign the correct L1_team value to each observation.

Here's the extended R code:

library(dplyr)
library(purrr)

# Step 1: Create a mapping of FundId and Date to the most common L1_team value
team_mapping_list <- lapply(names(innovator_list), function(date) {
  df <- innovator_list[[date]]
  df %>%
    group_by(FundId) %>%
    summarise(L1_team = which.max(table(L1_team)) - 1) %>%
    mutate(Date = as.Date(date))
})

# Combine the list into a single data frame
team_mapping <- bind_rows(team_mapping_list)

# Step 2: Join the mapping with scalability_full to assign L1_team values
scalability_full <- scalability_full %>%
  left_join(team_mapping, by = c("FundId", "Date"))

# The rest of your existing code remains the same
scalability_list <- lapply(innovator_list, function(df) {
  df %>%
    mutate(Neg_L1_Fund_Size_log = -L1_Fund_Size_log) %>%
    group_by(FundId) %>%
    do({
      tryCatch(
        tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
        error = function(e) {
          tibble(
            term = NA_character_,
            estimate = NA_real_,
            std.error = NA_real_,
            statistic = NA_real_,
            p.value = NA_real_
          )
        }
      )
    }) %>%
    ungroup()
})

scalability_full <- bind_rows(scalability_list, .id = "Date")
scalability_full$Date <- as.Date(scalability_full$Date)

This code will add the L1_team column to scalability_full with the correct values based on the most common L1_team value per FundId per Date in innovator_list. The which.max(table(L1_team)) - 1 expression is used to find the most common value of L1_team (assuming L1_team takes values 0 or 1). The - 1 is necessary because which.max will return 1 for the first most common value and 2 for the second, so we subtract 1 to get the actual L1_team value.

LilMichelangelo2 commented 2 months ago

@autopilot i have the data frame md_final where FundId and Date uniquely identify an observation. The column "team" takes the value of 1 or 0. I want you to split FundId into two based on the value of md_final$team. If a FundId has observations with both md_final$team == 0 and md_final$team == 1, then create a new FundId with the md_final$team == 0 observations and put a "_Switch" after FundId. Remove these observations from the original FundId observations

codeautopilot[bot] commented 2 months ago

@autopilot i have the data frame md_final where FundId and Date uniquely identify an observation. The column "team" takes the value of 1 or 0. I want you to split FundId into two based on the value of md_final$team. If a FundId has observations with both md_final$team == 0 and md_final$team == 1, then create a new FundId with the md_final$team == 0 observations and put a "_Switch" after FundId. Remove these observations from the original FundId observations

To achieve the splitting of FundId based on the value of md_final$team and create a new FundId with a "_Switch" suffix for those with both team == 0 and team == 1, you can use the following R code:

library(dplyr)

# Identify FundIds with both team == 0 and team == 1
funds_with_switch <- md_final %>%
  group_by(FundId) %>%
  filter(all(c(0, 1) %in% team)) %>%
  pull(FundId) %>%
  unique()

# Create a new dataframe with "_Switch" appended to FundIds that have both team values
md_final_switch <- md_final %>%
  filter(FundId %in% funds_with_switch & team == 0) %>%
  mutate(FundId = paste0(FundId, "_Switch"))

# Remove the team == 0 observations from the original FundIds that have both team values
md_final <- md_final %>%
  filter(!(FundId %in% funds_with_switch & team == 0))

# Combine the original and the "_Switch" dataframes
md_final <- bind_rows(md_final, md_final_switch)

# The resulting md_final now has the split FundIds as requested

This code first identifies FundIds that have observations with both team == 0 and team == 1. Then it creates a new dataframe with these FundIds and a "_Switch" suffix where team == 0. After that, it removes the team == 0 observations from the original FundIds that have both team values. Finally, it combines the original dataframe with the new "_Switch" dataframe to get the final result.

LilMichelangelo2 commented 2 months ago

@autopilot some FundId's have "_Switch" as postfix. ignore this postfix when checking for uniqueness with n_funds, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.

compute_additional_stats <- function(skill, value_creation_filtered) {
  percentage_negative <- sprintf("%.2f%%", 100 * mean(skill$Average_Skill < 0, na.rm = TRUE))
  n_funds <- format(length(unique(skill$FundId)), big.mark = ",", scientific = FALSE)
  obs <- format(nrow(value_creation_filtered), big.mark = ",", scientific = FALSE)

  list("Percent with less than zero" = percentage_negative, "Total number of funds" = n_funds, 
       "No. of obs." = obs)
}
codeautopilot[bot] commented 2 months ago

@autopilot some FundId's have "_Switch" as postfix. ignore this postfix when checking for uniqueness with n_funds, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.

compute_additional_stats <- function(skill, value_creation_filtered) {
  percentage_negative <- sprintf("%.2f%%", 100 * mean(skill$Average_Skill < 0, na.rm = TRUE))
  n_funds <- format(length(unique(skill$FundId)), big.mark = ",", scientific = FALSE)
  obs <- format(nrow(value_creation_filtered), big.mark = ",", scientific = FALSE)

  list("Percent with less than zero" = percentage_negative, "Total number of funds" = n_funds, 
       "No. of obs." = obs)
}

To modify the compute_additional_stats function so that it ignores the "_Switch" postfix when counting unique FundIds, you can use the gsub function to remove the "_Switch" part from the FundId before counting unique values. Here's the updated function:

compute_additional_stats <- function(skill, value_creation_filtered) {
  percentage_negative <- sprintf("%.2f%%", 100 * mean(skill$Average_Skill < 0, na.rm = TRUE))

  # Remove "_Switch" from FundId and then count unique values
  n_funds <- format(length(unique(gsub("_Switch", "", skill$FundId))), big.mark = ",", scientific = FALSE)

  obs <- format(nrow(value_creation_filtered), big.mark = ",", scientific = FALSE)

  list("Percent with less than zero" = percentage_negative, "Total number of funds" = n_funds, 
       "No. of obs." = obs)
}

This function will now correctly count FSUSA00007_Switch and FSUSA00007 as a single FundId.

LilMichelangelo2 commented 2 months ago

@autopilot some FundId's have "_Switch" as postfix. ignore this postfix when matching FundId's, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.

md_data <- md_data %>%
  left_join(md_obj_unique, by = c("FundId", "Date"))
codeautopilot[bot] commented 2 months ago

@autopilot some FundId's have "_Switch" as postfix. ignore this postfix when matching FundId's, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.

md_data <- md_data %>%
  left_join(md_obj_unique, by = c("FundId", "Date"))

To modify the join operation so that it ignores the "_Switch" postfix when matching FundIds, you can create a temporary key in both data frames that strips off the "_Switch" part before performing the join. Here's the updated code:

library(dplyr)

# Create a temporary key in both data frames that excludes the "_Switch" postfix
md_data <- md_data %>%
  mutate(FundId_key = gsub("_Switch", "", FundId))

md_obj_unique <- md_obj_unique %>%
  mutate(FundId_key = gsub("_Switch", "", FundId))

# Perform the left join using the temporary key and Date
md_data <- md_data %>%
  left_join(md_obj_unique, by = c("FundId_key" = "FundId_key", "Date"))

# Remove the temporary key column if it's no longer needed
md_data <- md_data %>%
  select(-FundId_key)

This code will treat FSUSA00007_Switch and FSUSA00007 as the same FundId when performing the join operation. After the join, the temporary key column is removed.

LilMichelangelo2 commented 2 months ago

@autopilot some FundId's in md_data have "_Switch" as postfix. ignore this postfix when matching FundId's from md_obj_unique, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.

md_data <- md_data %>%
  left_join(md_obj_unique, by = c("FundId", "Date"))
codeautopilot[bot] commented 2 months ago

@autopilot some FundId's in md_data have "_Switch" as postfix. ignore this postfix when matching FundId's from md_obj_unique, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.

md_data <- md_data %>%
  left_join(md_obj_unique, by = c("FundId", "Date"))

To modify the join operation so that it ignores the "_Switch" postfix when matching FundIds, you can create a temporary key in both data frames that strips off the "_Switch" part before performing the join. Here's the updated code:

library(dplyr)

# Create a temporary key in md_data that excludes the "_Switch" postfix
md_data <- md_data %>%
  mutate(FundId_key = gsub("_Switch", "", FundId))

# Perform the left join using the temporary key and Date
md_data <- md_data %>%
  left_join(md_obj_unique, by = c("FundId_key" = "FundId", "Date"))

# Remove the temporary key column if it's no longer needed
md_data <- md_data %>%
  select(-FundId_key)

This code will treat FSUSA00007_Switch and FSUSA00007 as the same FundId when performing the join operation. After the join, the temporary key column is removed.

LilMichelangelo2 commented 2 months ago

@autopilot instead of creating a list with 60-month observation windows, use the entire panel dataset. proceed with calculating scalability estimates per Date / FundId by regressing net_alpha ~ Neg_L1_Fund_Size_log + size_innovation on the prior 60 months of data per FundId. There must be at least 36 observations in the last 60 months per Date.

log_size <- adjusted_net_alpha

log_size <- log_size %>%
  arrange(Date, FundId)

log_size$Fund_Size_log <- log(log_size$Fund_Size)
log_size$L1_Fund_Size_log <- log(log_size$L1_Fund_Size)
log_size$Fund_Size <- NULL
log_size$L1_Fund_Size <- NULL 
log_size <- log_size %>% arrange(Date)

# 60-monatiges Schätzungsfenster formatieren 
min_valid_date <- min(log_size$Date) %m+% months(59)

valid_dates <- unique(log_size$Date[log_size$Date >= min_valid_date])

log_size_list <- lapply(valid_dates, function(current_date) {
  start_date <- current_date %m-% months(59)
  subset_data <- log_size %>%
    filter(Date >= start_date & Date <= current_date)
  return(subset_data)
})

# Fonds mit < 60 Beobachtungen entfernen 
log_size_list <- lapply(log_size_list, function(df) {
  df %>%
    group_by(FundId) %>%
    mutate(Count = n()) %>%
    filter(Count >= 36) %>%  
    dplyr::select(-Count) %>%
    ungroup()  
})

# Daten formatieren 
names(log_size_list) <- as.character(ceiling_date(valid_dates %m+% months(1), "month") - days(1))
log_size_list[[385]] <- NULL

# Erwartung des Skalierbarkeitsparameters berechnen nach Van Binsbergen et al. (2023), S. 30f
scalability_list <- lapply(log_size_list, function(df) {
  df %>%
    mutate(Neg_L1_Fund_Size_log = -L1_Fund_Size_log) %>%
    group_by(FundId) %>%
    do({
      tryCatch(
        tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
        error = function(e) {
          tibble(
            term = NA_character_,
            estimate = NA_real_,
            std.error = NA_real_,
            statistic = NA_real_,
            p.value = NA_real_
          )
        }
      )
    }) %>%
    ungroup()
})

scalability_full <- bind_rows(scalability_list, .id = "Date")
scalability_full$Date <- as.Date(scalability_full$Date)
codeautopilot[bot] commented 2 months ago

@autopilot instead of creating a list with 60-month observation windows, use the entire panel dataset. proceed with calculating scalability estimates per Date / FundId by regressing net_alpha ~ Neg_L1_Fund_Size_log + size_innovation on the prior 60 months of data per FundId. There must be at least 36 observations in the last 60 months per Date.

log_size <- adjusted_net_alpha

log_size <- log_size %>%
  arrange(Date, FundId)

log_size$Fund_Size_log <- log(log_size$Fund_Size)
log_size$L1_Fund_Size_log <- log(log_size$L1_Fund_Size)
log_size$Fund_Size <- NULL
log_size$L1_Fund_Size <- NULL 
log_size <- log_size %>% arrange(Date)

# 60-monatiges Schätzungsfenster formatieren 
min_valid_date <- min(log_size$Date) %m+% months(59)

valid_dates <- unique(log_size$Date[log_size$Date >= min_valid_date])

log_size_list <- lapply(valid_dates, function(current_date) {
  start_date <- current_date %m-% months(59)
  subset_data <- log_size %>%
    filter(Date >= start_date & Date <= current_date)
  return(subset_data)
})

# Fonds mit < 60 Beobachtungen entfernen 
log_size_list <- lapply(log_size_list, function(df) {
  df %>%
    group_by(FundId) %>%
    mutate(Count = n()) %>%
    filter(Count >= 36) %>%  
    dplyr::select(-Count) %>%
    ungroup()  
})

# Daten formatieren 
names(log_size_list) <- as.character(ceiling_date(valid_dates %m+% months(1), "month") - days(1))
log_size_list[[385]] <- NULL

# Erwartung des Skalierbarkeitsparameters berechnen nach Van Binsbergen et al. (2023), S. 30f
scalability_list <- lapply(log_size_list, function(df) {
  df %>%
    mutate(Neg_L1_Fund_Size_log = -L1_Fund_Size_log) %>%
    group_by(FundId) %>%
    do({
      tryCatch(
        tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
        error = function(e) {
          tibble(
            term = NA_character_,
            estimate = NA_real_,
            std.error = NA_real_,
            statistic = NA_real_,
            p.value = NA_real_
          )
        }
      )
    }) %>%
    ungroup()
})

scalability_full <- bind_rows(scalability_list, .id = "Date")
scalability_full$Date <- as.Date(scalability_full$Date)

To calculate scalability estimates per Date/FundId using the entire panel dataset and regressing net_alpha ~ Neg_L1_Fund_Size_log + size_innovation on the prior 60 months of data per FundId, with at least 36 observations in the last 60 months per Date, you can use the following R code:

library(dplyr)
library(lubridate)
library(broom)

# Assuming log_size is already defined and contains the necessary columns
log_size <- log_size %>%
  arrange(Date, FundId) %>%
  mutate(
    Fund_Size_log = log(Fund_Size),
    L1_Fund_Size_log = log(L1_Fund_Size),
    Neg_L1_Fund_Size_log = -L1_Fund_Size_log
  ) %>%
  select(-Fund_Size, -L1_Fund_Size)

# Define a function to perform the rolling regression
perform_rolling_regression <- function(df, current_date) {
  start_date <- current_date %m-% months(59)
  subset_data <- df %>%
    filter(Date >= start_date & Date <= current_date) %>%
    group_by(FundId) %>%
    filter(n() >= 36) # Ensure at least 36 observations in the last 60 months

  # Perform the regression for each FundId
  regression_results <- subset_data %>%
    do({
      tryCatch(
        tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
        error = function(e) {
          tibble(
            term = NA_character_,
            estimate = NA_real_,
            std.error = NA_real_,
            statistic = NA_real_,
            p.value = NA_real_
          )
        }
      )
    }) %>%
    ungroup() %>%
    mutate(Date = current_date) # Add the current date to the results

  return(regression_results)
}

# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x))

# The resulting scalability_estimates dataframe contains the regression results per Date/FundId

This code defines a function perform_rolling_regression that takes a dataframe and a date, filters the data for the 60 months leading up to that date, ensures there are at least 36 observations for each FundId, and then performs the regression. The map_df function from the purrr package is used to apply this function to each unique date in the dataset, and the results are combined into a single dataframe scalability_estimates.

LilMichelangelo2 commented 2 months ago

@autopilot is this the right approach to ensure that size_innovation has a zero mean for FundId over all observations in each list element?

names(pSize_list) <- names(log_size_list)
innovator_list <- lapply(pSize_list, function(df) {
  df <- df %>% 
    mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log))
  return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
  df <- df %>%
    group_by(FundId) %>%
    mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
    ungroup()
  return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
  df <- df %>%
    mutate(size_innovation = initial_innovation - mean_innovation)
  return(df)
})
codeautopilot[bot] commented 2 months ago

@autopilot is this the right approach to ensure that size_innovation has a zero mean for FundId over all observations in each list element?

names(pSize_list) <- names(log_size_list)
innovator_list <- lapply(pSize_list, function(df) {
  df <- df %>% 
    mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log))
  return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
  df <- df %>%
    group_by(FundId) %>%
    mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
    ungroup()
  return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
  df <- df %>%
    mutate(size_innovation = initial_innovation - mean_innovation)
  return(df)
})

The approach you've taken in the R code is correct for ensuring that size_innovation has a zero mean for each FundId over all observations within each list element. You are calculating an initial innovation metric, then computing the mean of this metric for each FundId, and finally centering the initial innovation by subtracting the mean to get size_innovation. This process will result in size_innovation having a mean of zero for each FundId within each list element.

Here's a breakdown of the steps you've taken:

  1. Calculate initial_innovation for each observation as the difference between Fund_Size_log and PMUE * L1_Fund_Size_log.
  2. Group by FundId and calculate the mean of initial_innovation for each FundId.
  3. Subtract the mean_innovation from initial_innovation to get size_innovation, which will have a mean of zero for each FundId.

The code is well-structured and uses the dplyr package effectively to perform the necessary transformations. If pSize_list and log_size_list have the same structure and PMUE is defined correctly, your approach should work as intended.

LilMichelangelo2 commented 2 months ago

@autopilot extend and adjust the code 2 in R to integrate the code 1, i.e., the size_innovation calculation

Code 2:

log_size <- log_size %>%
  arrange(Date, FundId) %>%
  mutate(
    Fund_Size_log = log(Fund_Size),
    L1_Fund_Size_log = log(L1_Fund_Size),
    Neg_L1_Fund_Size_log = -L1_Fund_Size_log
  ) %>%
  select(-Fund_Size, -L1_Fund_Size)

# Define a function to perform the rolling regression
perform_rolling_regression <- function(df, current_date) {
  start_date <- current_date %m-% months(59)
  subset_data <- df %>%
    filter(Date >= start_date & Date <= current_date) %>%
    group_by(FundId) %>%
    filter(n() >= 36) # Ensure at least 36 observations in the last 60 months

  # Perform the regression for each FundId
  regression_results <- subset_data %>%
    do({
      tryCatch(
        tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
        error = function(e) {
          tibble(
            term = NA_character_,
            estimate = NA_real_,
            std.error = NA_real_,
            statistic = NA_real_,
            p.value = NA_real_
          )
        }
      )
    }) %>%
    ungroup() %>%
    mutate(Date = current_date) # Add the current date to the results

  return(regression_results)
}

# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x))

Code 1:

innovator_list <- lapply(pSize_list, function(df) {
  df <- df %>% 
    mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log))
  return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
  df <- df %>%
    group_by(FundId) %>%
    mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
    ungroup()
  return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
  df <- df %>%
    mutate(size_innovation = initial_innovation - mean_innovation)
  return(df)
})
codeautopilot[bot] commented 2 months ago

@autopilot extend and adjust the code 2 in R to integrate the code 1, i.e., the size_innovation calculation

Code 2:

log_size <- log_size %>%
  arrange(Date, FundId) %>%
  mutate(
    Fund_Size_log = log(Fund_Size),
    L1_Fund_Size_log = log(L1_Fund_Size),
    Neg_L1_Fund_Size_log = -L1_Fund_Size_log
  ) %>%
  select(-Fund_Size, -L1_Fund_Size)

# Define a function to perform the rolling regression
perform_rolling_regression <- function(df, current_date) {
  start_date <- current_date %m-% months(59)
  subset_data <- df %>%
    filter(Date >= start_date & Date <= current_date) %>%
    group_by(FundId) %>%
    filter(n() >= 36) # Ensure at least 36 observations in the last 60 months

  # Perform the regression for each FundId
  regression_results <- subset_data %>%
    do({
      tryCatch(
        tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
        error = function(e) {
          tibble(
            term = NA_character_,
            estimate = NA_real_,
            std.error = NA_real_,
            statistic = NA_real_,
            p.value = NA_real_
          )
        }
      )
    }) %>%
    ungroup() %>%
    mutate(Date = current_date) # Add the current date to the results

  return(regression_results)
}

# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x))

Code 1:

innovator_list <- lapply(pSize_list, function(df) {
  df <- df %>% 
    mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log))
  return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
  df <- df %>%
    group_by(FundId) %>%
    mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
    ungroup()
  return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
  df <- df %>%
    mutate(size_innovation = initial_innovation - mean_innovation)
  return(df)
})

To integrate the size_innovation calculation from Code 1 into Code 2, you need to adjust the perform_rolling_regression function to include the steps for calculating initial_innovation, mean_innovation, and size_innovation. Here's the extended and adjusted Code 2:

library(dplyr)
library(lubridate)
library(purrr)
library(broom)

log_size <- log_size %>%
  arrange(Date, FundId) %>%
  mutate(
    Fund_Size_log = log(Fund_Size),
    L1_Fund_Size_log = log(L1_Fund_Size),
    Neg_L1_Fund_Size_log = -L1_Fund_Size_log
  ) %>%
  select(-Fund_Size, -L1_Fund_Size)

# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, current_date, PMUE) {
  start_date <- current_date %m-% months(59)
  subset_data <- df %>%
    filter(Date >= start_date & Date <= current_date) %>%
    group_by(FundId) %>%
    filter(n() >= 36) # Ensure at least 36 observations in the last 60 months

  # Calculate size_innovation within the rolling window
  subset_data <- subset_data %>%
    mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
    group_by(FundId) %>%
    mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
    mutate(size_innovation = initial_innovation - mean_innovation) %>%
    ungroup()

  # Perform the regression for each FundId
  regression_results <- subset_data %>%
    do({
      tryCatch(
        tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
        error = function(e) {
          tibble(
            term = NA_character_,
            estimate = NA_real_,
            std.error = NA_real_,
            statistic = NA_real_,
            p.value = NA_real_
          )
        }
      )
    }) %>%
    ungroup() %>%
    mutate(Date = current_date) # Add the current date to the results

  return(regression_results)
}

# Assuming PMUE is defined and available
# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x, PMUE))

In this adjusted code, the perform_rolling_regression function now includes the steps to calculate size_innovation for each FundId within the rolling window of 60 months. The PMUE parameter is passed to the function, which is used in the calculation of initial_innovation. After calculating size_innovation, the function proceeds with the regression analysis as before. The map_df function is used to apply this updated rolling regression function to each unique date in the dataset.

LilMichelangelo2 commented 2 months ago

@autopilot scalability is to be calculated for each FundId Date combination, not just Date.

library(dplyr)
library(lubridate)
library(purrr)
library(broom)

log_size <- log_size %>%
  arrange(Date, FundId) %>%
  mutate(
    Fund_Size_log = log(Fund_Size),
    L1_Fund_Size_log = log(L1_Fund_Size),
    Neg_L1_Fund_Size_log = -L1_Fund_Size_log
  ) %>%
  select(-Fund_Size, -L1_Fund_Size)

# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, current_date, PMUE) {
  start_date <- current_date %m-% months(59)
  subset_data <- df %>%
    filter(Date >= start_date & Date <= current_date) %>%
    group_by(FundId) %>%
    filter(n() >= 36) # Ensure at least 36 observations in the last 60 months

  # Calculate size_innovation within the rolling window
  subset_data <- subset_data %>%
    mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
    group_by(FundId) %>%
    mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
    mutate(size_innovation = initial_innovation - mean_innovation) %>%
    ungroup()

  # Perform the regression for each FundId
  regression_results <- subset_data %>%
    do({
      tryCatch(
        tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
        error = function(e) {
          tibble(
            term = NA_character_,
            estimate = NA_real_,
            std.error = NA_real_,
            statistic = NA_real_,
            p.value = NA_real_
          )
        }
      )
    }) %>%
    ungroup() %>%
    mutate(Date = current_date) # Add the current date to the results

  return(regression_results)
}

# Assuming PMUE is defined and available
# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x, PMUE))
codeautopilot[bot] commented 2 months ago

@autopilot scalability is to be calculated for each FundId Date combination, not just Date.

library(dplyr)
library(lubridate)
library(purrr)
library(broom)

log_size <- log_size %>%
  arrange(Date, FundId) %>%
  mutate(
    Fund_Size_log = log(Fund_Size),
    L1_Fund_Size_log = log(L1_Fund_Size),
    Neg_L1_Fund_Size_log = -L1_Fund_Size_log
  ) %>%
  select(-Fund_Size, -L1_Fund_Size)

# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, current_date, PMUE) {
  start_date <- current_date %m-% months(59)
  subset_data <- df %>%
    filter(Date >= start_date & Date <= current_date) %>%
    group_by(FundId) %>%
    filter(n() >= 36) # Ensure at least 36 observations in the last 60 months

  # Calculate size_innovation within the rolling window
  subset_data <- subset_data %>%
    mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
    group_by(FundId) %>%
    mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
    mutate(size_innovation = initial_innovation - mean_innovation) %>%
    ungroup()

  # Perform the regression for each FundId
  regression_results <- subset_data %>%
    do({
      tryCatch(
        tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
        error = function(e) {
          tibble(
            term = NA_character_,
            estimate = NA_real_,
            std.error = NA_real_,
            statistic = NA_real_,
            p.value = NA_real_
          )
        }
      )
    }) %>%
    ungroup() %>%
    mutate(Date = current_date) # Add the current date to the results

  return(regression_results)
}

# Assuming PMUE is defined and available
# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x, PMUE))

To calculate scalability for each FundId and Date combination, you need to adjust the code to iterate over each FundId and Date pair rather than just Date. Here's the updated code:

library(dplyr)
library(lubridate)
library(purrr)
library(broom)

log_size <- log_size %>%
  arrange(Date, FundId) %>%
  mutate(
    Fund_Size_log = log(Fund_Size),
    L1_Fund_Size_log = log(L1_Fund_Size),
    Neg_L1_Fund_Size_log = -L1_Fund_Size_log
  ) %>%
  select(-Fund_Size, -L1_Fund_Size)

# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, fund_id, current_date, PMUE) {
  start_date <- current_date %m-% months(59)
  subset_data <- df %>%
    filter(FundId == fund_id & Date >= start_date & Date <= current_date) %>%
    filter(n() >= 36) # Ensure at least 36 observations in the last 60 months

  # Calculate size_innovation within the rolling window
  subset_data <- subset_data %>%
    mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
    mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
    mutate(size_innovation = initial_innovation - mean_innovation)

  # Perform the regression
  regression_results <- tryCatch(
    tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = subset_data)),
    error = function(e) {
      tibble(
        term = NA_character_,
        estimate = NA_real_,
        std.error = NA_real_,
        statistic = NA_real_,
        p.value = NA_real_
      )
    }
  )

  return(regression_results %>% mutate(FundId = fund_id, Date = current_date))
}

# Assuming PMUE is defined and available
# Calculate scalability estimates for each FundId and Date combination
fund_dates <- unique(log_size %>% select(FundId, Date))
scalability_estimates <- map_df(1:nrow(fund_dates), function(i) {
  with(fund_dates[i,], perform_rolling_regression(log_size, FundId, Date, PMUE))
})

# The resulting scalability_estimates dataframe contains the regression results per FundId and Date

In this updated code, we first create a data frame fund_dates that contains all unique combinations of FundId and Date. Then, we use map_df to apply the perform_rolling_regression function to each row of fund_dates. The function now takes an additional parameter fund_id to filter the data for the specific FundId and Date combination. The regression is performed for each combination, and the results are combined into the scalability_estimates data frame.

LilMichelangelo2 commented 2 months ago

@autopilot make this code more efficient:

library(dplyr)
library(lubridate)
library(purrr)
library(broom)

log_size <- log_size %>%
  arrange(Date, FundId) %>%
  mutate(
    Fund_Size_log = log(Fund_Size),
    L1_Fund_Size_log = log(L1_Fund_Size),
    Neg_L1_Fund_Size_log = -L1_Fund_Size_log
  ) %>%
  select(-Fund_Size, -L1_Fund_Size)

# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, fund_id, current_date, PMUE) {
  start_date <- current_date %m-% months(59)
  subset_data <- df %>%
    filter(FundId == fund_id & Date >= start_date & Date <= current_date) %>%
    filter(n() >= 36) # Ensure at least 36 observations in the last 60 months

  # Calculate size_innovation within the rolling window
  subset_data <- subset_data %>%
    mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
    mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
    mutate(size_innovation = initial_innovation - mean_innovation)

  # Perform the regression
  regression_results <- tryCatch(
    tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = subset_data)),
    error = function(e) {
      tibble(
        term = NA_character_,
        estimate = NA_real_,
        std.error = NA_real_,
        statistic = NA_real_,
        p.value = NA_real_
      )
    }
  )

  return(regression_results %>% mutate(FundId = fund_id, Date = current_date))
}

# Assuming PMUE is defined and available
# Calculate scalability estimates for each FundId and Date combination
fund_dates <- unique(log_size %>% select(FundId, Date))
scalability_estimates <- map_df(1:nrow(fund_dates), function(i) {
  with(fund_dates[i,], perform_rolling_regression(log_size, FundId, Date, PMUE))
})
codeautopilot[bot] commented 2 months ago

@autopilot make this code more efficient:

library(dplyr)
library(lubridate)
library(purrr)
library(broom)

log_size <- log_size %>%
  arrange(Date, FundId) %>%
  mutate(
    Fund_Size_log = log(Fund_Size),
    L1_Fund_Size_log = log(L1_Fund_Size),
    Neg_L1_Fund_Size_log = -L1_Fund_Size_log
  ) %>%
  select(-Fund_Size, -L1_Fund_Size)

# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, fund_id, current_date, PMUE) {
  start_date <- current_date %m-% months(59)
  subset_data <- df %>%
    filter(FundId == fund_id & Date >= start_date & Date <= current_date) %>%
    filter(n() >= 36) # Ensure at least 36 observations in the last 60 months

  # Calculate size_innovation within the rolling window
  subset_data <- subset_data %>%
    mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
    mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
    mutate(size_innovation = initial_innovation - mean_innovation)

  # Perform the regression
  regression_results <- tryCatch(
    tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = subset_data)),
    error = function(e) {
      tibble(
        term = NA_character_,
        estimate = NA_real_,
        std.error = NA_real_,
        statistic = NA_real_,
        p.value = NA_real_
      )
    }
  )

  return(regression_results %>% mutate(FundId = fund_id, Date = current_date))
}

# Assuming PMUE is defined and available
# Calculate scalability estimates for each FundId and Date combination
fund_dates <- unique(log_size %>% select(FundId, Date))
scalability_estimates <- map_df(1:nrow(fund_dates), function(i) {
  with(fund_dates[i,], perform_rolling_regression(log_size, FundId, Date, PMUE))
})

To make the code more efficient, consider the following optimizations:

  1. Pre-calculate the logarithms and negative logarithms outside of the rolling regression function to avoid redundant calculations.
  2. Use data.table for faster data manipulation, especially for large datasets.
  3. Instead of using map_df with row indices, directly iterate over the unique FundId and Date combinations.
  4. Use a rolling join with data.table to efficiently filter the 60-month window for each FundId and Date combination.

Here's the optimized code:

library(data.table)
library(lubridate)

# Convert log_size to a data.table for efficient operations
setDT(log_size)
log_size[, `:=`(
  Fund_Size_log = log(Fund_Size),
  L1_Fund_Size_log = log(L1_Fund_Size),
  Neg_L1_Fund_Size_log = -L1_Fund_Size_log
)]
log_size[, c("Fund_Size", "L1_Fund_Size") := NULL]

# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(subset_data, PMUE) {
  # Calculate size_innovation within the rolling window
  subset_data[, `:=`(
    initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log),
    mean_innovation = mean(initial_innovation, na.rm = TRUE),
    size_innovation = initial_innovation - mean_innovation
  )]

  # Perform the regression
  regression_results <- tryCatch(
    as.data.frame(tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = subset_data))),
    error = function(e) {
      data.frame(
        term = NA_character_,
        estimate = NA_real_,
        std.error = NA_real_,
        statistic = NA_real_,
        p.value = NA_real_
      )
    }
  )

  return(regression_results)
}

# Assuming PMUE is defined and available
# Calculate scalability estimates for each FundId and Date combination
scalability_estimates <- rbindlist(lapply(unique(log_size[, .(FundId, Date)]), function(x) {
  current_date <- x$Date
  fund_id <- x$FundId
  start_date <- current_date %m-% months(59)

  # Efficient rolling join to get the 60-month window for the current FundId and Date
  subset_data <- log_size[FundId == fund_id & Date >= start_date & Date <= current_date]

  if (nrow(subset_data) >= 36) {
    regression_results <- perform_rolling_regression(subset_data, PMUE)
    regression_results[, `:=`(FundId = fund_id, Date = current_date)]
    return(regression_results)
  } else {
    return(NULL)
  }
}), fill = TRUE)

# Convert scalability_estimates back to a data.frame if needed
scalability_estimates <- as.data.frame(scalability_estimates)

This optimized code uses data.table for faster data manipulation and performs the rolling regression only if there are at least 36 observations in the 60-month window. The rbindlist function is used to combine the results, and the fill = TRUE argument ensures that all columns are present even if some regressions return NULL (due to insufficient observations).

LilMichelangelo2 commented 2 months ago

@autopilot the outcome Gross_Alpha for the following code is correct for equal-weighting (the mean of the mean of every FundId) but wrong for value-weighting (number of observations per FundId * mean Gross_Alpha of FundId / Sum of observations across all FundIds). Comment on why that is:

library(tidyr)
library(broom)
library(dplyr)
library(stringr)
library(purrr)
library(lubridate)
library(zoo)
library(lmtest)

md_final <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/MS/md_final.csv")
md_final$Date <- as.Date(md_final$Date)
md_final <- md_final[,2:ncol(md_final)]

md_funds <- md_final
md_funds$Date <- as.Date(md_funds$Date)
md_funds$Date <- ceiling_date(md_funds$Date, "month") - days(1)
md_funds <- md_funds[,c(1,2,18)]

#WRDS Fama-French 5 Factors Plus Momentum - Monthly Frequency 
rf <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/External Data/rf.csv")
rf$date <- as.Date(rf$date)
rf$date <- ceiling_date(rf$date, "month") - days(1)
rf$dateff <- NULL
rf <- rf %>%
  distinct(date, .keep_all = TRUE)

sorted_funds <- md_funds %>%
  arrange(Date, FundId)
sorted_funds <- sorted_funds %>%
  left_join(rf, by = c("Date" = "date"))
sorted_funds <- sorted_funds %>%
  mutate(Gross_Excess_Return = Gross_Return - rf)

sorted_funds$Gross_Return <- NULL
sorted_funds$rf <- NULL
#####
sorted_funds <- sorted_funds %>%
  filter(year(Date) <= 2014)
#####

funds <- sorted_funds %>%
  pivot_wider(names_from = FundId, values_from = Gross_Excess_Return)
funds$Date <- as.character(funds$Date)
fund_return <- as.matrix(funds[,2:ncol(funds)])

#WRDS CRSP Mutual Fund Database 
index <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/External Data/Vanguard.csv")
index$Index_Return <- index$mret
index$mret <- NULL
index$caldt <- as.Date(index$caldt)
index$caldt <- ceiling_date(index$caldt, "month") - days(1)
index <- index[,c(1,2,4)]

sorted_index <- index %>%
  arrange(caldt, ticker)
index_mapping <- setNames(c("index_1", "index_2", "index_3", "index_4", "index_5", "index_6", 
                            "index_7", "index_8", "index_9", "index_10", "index_11"), 
                          c("VFINX", "VEXMX", "NAESX", "VEURX", "VPACX", "VVIAX", 
                            "VBINX", "VEIEX", "VIMSX", "VISGX", "VISVX"))
sorted_index <- sorted_index %>%
  mutate(index_no = index_mapping[as.character(ticker)]) 
sorted_index <- sorted_index %>%
  left_join(rf, by = c("caldt" = "date"))
sorted_index$Index_Return <- as.numeric(sorted_index$Index_Return)
sorted_index <- sorted_index %>%
  mutate(Excess_Index = Index_Return - rf)

sorted_index <- sorted_index[,c(4,2,6)]
#####
sorted_index <- sorted_index %>%
  filter(year(caldt) <= 2014)
#####
sorted_index <- na.omit(sorted_index)
benchmark <- sorted_index %>%
  pivot_wider(names_from = index_no, values_from = Excess_Index)
benchmark[, 2:12] <- lapply(benchmark[, 2:12], as.numeric)

fund_return_list <- lapply(seq_len(ncol(fund_return)), function(i) {
  return_vector <- fund_return[, i]
  names(return_vector) <- rownames(fund_return)
  return(return_vector)
})
names(fund_return_list) <- colnames(fund_return)

# Create a matrix "X" with 12 columns (intercept, index_1 to index_11)
X <- as.matrix(cbind(1, benchmark[2:12]))

###### X0_hat ###### 

#Orthogonalisiere ohne Intercept  
X0_hat <- X[, 2:12] 

#Orthogonalisierungsverfahren
orthogonalize_variable <- function(variable, previous_variables_matrix) {

  #Index des ersten non-NA-Wertes 
  first_obs_variable <- which(!is.na(variable))[1]

  #Zeitreihenvektor (erste bis letzte Beobachtung von aktueller Variable n)
  valid_length_variable <- first_obs_variable:length(variable)

  #Zeitreihenvektor (erste bis letzte Beobachtungen vorheriger und orthogonalisierten Variablen n-1)
  valid_length_previous_vars <- first_obs_variable:nrow(previous_variables_matrix)

  #Unterstichprobe auf den Zeitraum von n
  if (length(valid_length_previous_vars) > length(valid_length_variable)) {
    valid_length_previous_vars <- valid_length_previous_vars[1:length(valid_length_variable)]
  }

  #Lineare Projektion von n auf n-1 
  fit <- lm(variable[valid_length_variable] ~ previous_variables_matrix[valid_length_previous_vars, , drop = FALSE])

  #Die Summe aus Residuen und Konstanter (Intercept) ergibt die Werte des nten orthogonalisierten Fonds (BvB, S. 19)
  intercept <- coef(fit)[1] 
  residuals_with_mean <- residuals(fit) + intercept

  #Austausch der alten durch neue Werte 
  variable[valid_length_variable] <- residuals_with_mean
  return(variable)
}

#Durchführung der sequentiellen Orthogonalisierung 
for (j in 2:ncol(X0_hat)) {
  previous_variables_matrix <- X0_hat[, 1:(j - 1), drop = FALSE]
  X0_hat[, j] <- orthogonalize_variable(X0_hat[, j], previous_variables_matrix)
}

#NAs mit 0 austauschen 
X0_hat[is.na(X0_hat)] <- 0

#Intercept hinzufügen 
X0_hat <- cbind(1, X0_hat)

###### X0 ###### 

#Verfahren zum Austauschen der NAs durch Mittelwerte der orthogonalisierten Fonds Returns  
impute_with_mean <- function(X) {
  for (j in 1:ncol(X)) {
    # Calculate mean excluding zeros and NAs
    column_mean <- mean(X[X[, j] != 0, j], na.rm = TRUE)
    # Replace zeros with the column mean
    X[X[, j] == 0, j] <- column_mean
  }
  return(X)
}

#Implementierung der Mittelwerte 
X0 <- impute_with_mean(X0_hat)

###### Regression ###### 
# X0 Matrix zu Dataframe
X0_df = as.data.frame(X0)
X0_df$Date <- funds$Date
X0_df$Date <- as.Date(X0_df$Date) 

# Fonds- und Benchmarkdaten verschmelzen
X0_merged <- merge(sorted_funds, X0_df, by = "Date", all.x = TRUE)

# Berechnung der Koeffizienten via OLS Regression
results <- X0_merged %>%
  group_by(FundId) %>%
  do(tidy(lm(Gross_Excess_Return ~ V1 + index_1 + index_2 + index_3 + index_4 + index_5 + 
               index_6 + index_7 + index_8 + index_9 + index_10 + index_11 - 1 , data = .))) %>%
  ungroup()

# Bau des Koeffizienten Dataframes  
fund_coefficients <- results %>%
  pivot_wider(names_from = term, values_from = estimate, id_cols = FundId) %>%
  rename_with(~ gsub("^\\(Intercept\\)$", "V1", .x), .cols = everything()) %>%
  dplyr::select(FundId, V1, starts_with("index_"))

# #Verfarhen zur Berechnung der Koeffizienten
# fit_model <- function(returns) {
#   data_for_model <- data.frame(fund_return = returns, X0)
#   model <- lm(fund_return ~ . - 1, data = data_for_model) 
#   return(coefficients(model)) 
# }
# 
# #Berechnung der Koeffizienten 
# models_list <- lapply(fund_return_list, fit_model)
# 
# #Bau des Koeffizienten Dataframes  
# fund_coefficients <- data.frame(do.call(rbind, models_list))
# fund_coefficients$FundId <- names(models_list)
# colnames(fund_coefficients)[1] <- "intercept"
# new_names <- paste("index", 1:11, sep="_")
# colnames(fund_coefficients)[2:12] <- new_names

###### Benchmark-Mapping ###### 
#Umwandlung der Matrix X0_hat in einen Dataframe  
X0_hat_data <- data.frame(X0_hat)
X0_hat_data <- X0_hat_data[,2:12]
X0_hat_data$Date <- funds$Date
X0_hat_data$Date <- as.Date(X0_hat_data$Date)

#Zuordnung orthogonalisierte Benchmark Returns und Regressionskoeffizienten je Beobachtung (FundId - Date)
fund_benchmark_returns <- md_funds[,1:2]
fund_benchmark_returns <- fund_benchmark_returns %>%
  left_join(fund_coefficients, by = "FundId")
fund_benchmark_returns$intercept <- NULL
fund_benchmark_returns <- merge(X0_hat_data, fund_benchmark_returns, by = "Date", all = TRUE)
fund_benchmark_returns <- fund_benchmark_returns[,c(13,1,2:12,15:25)]

#Berechnung der spezifischen Benchmark Returns (Koeffizienten * orthogonalisierte Benchmark Returns) je Beobachtung
index_x_cols <- grep("index_[0-9]+\\.x", names(fund_benchmark_returns), value = TRUE)
index_y_cols <- gsub("\\.x$", ".y", index_x_cols)  
new_index_cols <- gsub("\\.x$", "", gsub("^index_", "", index_x_cols))  
for (i in seq_along(index_x_cols)) {
  x_col <- index_x_cols[i]
  y_col <- index_y_cols[i]
  new_col <- new_index_cols[i]
  if (y_col %in% names(fund_benchmark_returns)) {
    fund_benchmark_returns[[new_col]] <- fund_benchmark_returns[[x_col]] * fund_benchmark_returns[[y_col]]
  }
}
fund_benchmark_returns <- fund_benchmark_returns[,c(2,1, 25:35)]
fund_benchmark_returns <- fund_benchmark_returns %>%
  rename_with(~ paste("index", 1:11, sep = "_"), .cols = 3:13)
fund_benchmark_returns$benchmark_return <- rowSums(fund_benchmark_returns[, 3:13], na.rm = TRUE)

###### Benchmark-Mapping ######

#Zuordnung von Benchmark Returns zu Gross Excess Returns
adjusted_gross_alpha <- left_join(fund_benchmark_returns, sorted_funds[, c("FundId", "Date", "Gross_Excess_Return")], 
                                  by = c("FundId", "Date"))
adjusted_gross_alpha <- adjusted_gross_alpha[,c(1,2,15,14)]

#Berechnung der Gross Alphas (Gross Excess Return minus der Summe der Benchmark Returns)
adjusted_gross_alpha <- adjusted_gross_alpha %>%
  filter(!is.na(Gross_Excess_Return))
adjusted_gross_alpha$Gross_Alpha <- adjusted_gross_alpha$Gross_Excess_Return - adjusted_gross_alpha$benchmark_return
adjusted_gross_alpha <- adjusted_gross_alpha[,c(1:2,5,3,4)]
codeautopilot[bot] commented 2 months ago

@autopilot the outcome Gross_Alpha for the following code is correct for equal-weighting (the mean of the mean of every FundId) but wrong for value-weighting (number of observations per FundId * mean Gross_Alpha of FundId / Sum of observations across all FundIds). Comment on why that is:

library(tidyr)
library(broom)
library(dplyr)
library(stringr)
library(purrr)
library(lubridate)
library(zoo)
library(lmtest)

md_final <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/MS/md_final.csv")
md_final$Date <- as.Date(md_final$Date)
md_final <- md_final[,2:ncol(md_final)]

md_funds <- md_final
md_funds$Date <- as.Date(md_funds$Date)
md_funds$Date <- ceiling_date(md_funds$Date, "month") - days(1)
md_funds <- md_funds[,c(1,2,18)]

#WRDS Fama-French 5 Factors Plus Momentum - Monthly Frequency 
rf <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/External Data/rf.csv")
rf$date <- as.Date(rf$date)
rf$date <- ceiling_date(rf$date, "month") - days(1)
rf$dateff <- NULL
rf <- rf %>%
  distinct(date, .keep_all = TRUE)

sorted_funds <- md_funds %>%
  arrange(Date, FundId)
sorted_funds <- sorted_funds %>%
  left_join(rf, by = c("Date" = "date"))
sorted_funds <- sorted_funds %>%
  mutate(Gross_Excess_Return = Gross_Return - rf)

sorted_funds$Gross_Return <- NULL
sorted_funds$rf <- NULL
#####
sorted_funds <- sorted_funds %>%
  filter(year(Date) <= 2014)
#####

funds <- sorted_funds %>%
  pivot_wider(names_from = FundId, values_from = Gross_Excess_Return)
funds$Date <- as.character(funds$Date)
fund_return <- as.matrix(funds[,2:ncol(funds)])

#WRDS CRSP Mutual Fund Database 
index <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/External Data/Vanguard.csv")
index$Index_Return <- index$mret
index$mret <- NULL
index$caldt <- as.Date(index$caldt)
index$caldt <- ceiling_date(index$caldt, "month") - days(1)
index <- index[,c(1,2,4)]

sorted_index <- index %>%
  arrange(caldt, ticker)
index_mapping <- setNames(c("index_1", "index_2", "index_3", "index_4", "index_5", "index_6", 
                            "index_7", "index_8", "index_9", "index_10", "index_11"), 
                          c("VFINX", "VEXMX", "NAESX", "VEURX", "VPACX", "VVIAX", 
                            "VBINX", "VEIEX", "VIMSX", "VISGX", "VISVX"))
sorted_index <- sorted_index %>%
  mutate(index_no = index_mapping[as.character(ticker)]) 
sorted_index <- sorted_index %>%
  left_join(rf, by = c("caldt" = "date"))
sorted_index$Index_Return <- as.numeric(sorted_index$Index_Return)
sorted_index <- sorted_index %>%
  mutate(Excess_Index = Index_Return - rf)

sorted_index <- sorted_index[,c(4,2,6)]
#####
sorted_index <- sorted_index %>%
  filter(year(caldt) <= 2014)
#####
sorted_index <- na.omit(sorted_index)
benchmark <- sorted_index %>%
  pivot_wider(names_from = index_no, values_from = Excess_Index)
benchmark[, 2:12] <- lapply(benchmark[, 2:12], as.numeric)

fund_return_list <- lapply(seq_len(ncol(fund_return)), function(i) {
  return_vector <- fund_return[, i]
  names(return_vector) <- rownames(fund_return)
  return(return_vector)
})
names(fund_return_list) <- colnames(fund_return)

# Create a matrix "X" with 12 columns (intercept, index_1 to index_11)
X <- as.matrix(cbind(1, benchmark[2:12]))

###### X0_hat ###### 

#Orthogonalisiere ohne Intercept  
X0_hat <- X[, 2:12] 

#Orthogonalisierungsverfahren
orthogonalize_variable <- function(variable, previous_variables_matrix) {

  #Index des ersten non-NA-Wertes 
  first_obs_variable <- which(!is.na(variable))[1]

  #Zeitreihenvektor (erste bis letzte Beobachtung von aktueller Variable n)
  valid_length_variable <- first_obs_variable:length(variable)

  #Zeitreihenvektor (erste bis letzte Beobachtungen vorheriger und orthogonalisierten Variablen n-1)
  valid_length_previous_vars <- first_obs_variable:nrow(previous_variables_matrix)

  #Unterstichprobe auf den Zeitraum von n
  if (length(valid_length_previous_vars) > length(valid_length_variable)) {
    valid_length_previous_vars <- valid_length_previous_vars[1:length(valid_length_variable)]
  }

  #Lineare Projektion von n auf n-1 
  fit <- lm(variable[valid_length_variable] ~ previous_variables_matrix[valid_length_previous_vars, , drop = FALSE])

  #Die Summe aus Residuen und Konstanter (Intercept) ergibt die Werte des nten orthogonalisierten Fonds (BvB, S. 19)
  intercept <- coef(fit)[1] 
  residuals_with_mean <- residuals(fit) + intercept

  #Austausch der alten durch neue Werte 
  variable[valid_length_variable] <- residuals_with_mean
  return(variable)
}

#Durchführung der sequentiellen Orthogonalisierung 
for (j in 2:ncol(X0_hat)) {
  previous_variables_matrix <- X0_hat[, 1:(j - 1), drop = FALSE]
  X0_hat[, j] <- orthogonalize_variable(X0_hat[, j], previous_variables_matrix)
}

#NAs mit 0 austauschen 
X0_hat[is.na(X0_hat)] <- 0

#Intercept hinzufügen 
X0_hat <- cbind(1, X0_hat)

###### X0 ###### 

#Verfahren zum Austauschen der NAs durch Mittelwerte der orthogonalisierten Fonds Returns  
impute_with_mean <- function(X) {
  for (j in 1:ncol(X)) {
    # Calculate mean excluding zeros and NAs
    column_mean <- mean(X[X[, j] != 0, j], na.rm = TRUE)
    # Replace zeros with the column mean
    X[X[, j] == 0, j] <- column_mean
  }
  return(X)
}

#Implementierung der Mittelwerte 
X0 <- impute_with_mean(X0_hat)

###### Regression ###### 
# X0 Matrix zu Dataframe
X0_df = as.data.frame(X0)
X0_df$Date <- funds$Date
X0_df$Date <- as.Date(X0_df$Date) 

# Fonds- und Benchmarkdaten verschmelzen
X0_merged <- merge(sorted_funds, X0_df, by = "Date", all.x = TRUE)

# Berechnung der Koeffizienten via OLS Regression
results <- X0_merged %>%
  group_by(FundId) %>%
  do(tidy(lm(Gross_Excess_Return ~ V1 + index_1 + index_2 + index_3 + index_4 + index_5 + 
               index_6 + index_7 + index_8 + index_9 + index_10 + index_11 - 1 , data = .))) %>%
  ungroup()

# Bau des Koeffizienten Dataframes  
fund_coefficients <- results %>%
  pivot_wider(names_from = term, values_from = estimate, id_cols = FundId) %>%
  rename_with(~ gsub("^\\(Intercept\\)$", "V1", .x), .cols = everything()) %>%
  dplyr::select(FundId, V1, starts_with("index_"))

# #Verfarhen zur Berechnung der Koeffizienten
# fit_model <- function(returns) {
#   data_for_model <- data.frame(fund_return = returns, X0)
#   model <- lm(fund_return ~ . - 1, data = data_for_model) 
#   return(coefficients(model)) 
# }
# 
# #Berechnung der Koeffizienten 
# models_list <- lapply(fund_return_list, fit_model)
# 
# #Bau des Koeffizienten Dataframes  
# fund_coefficients <- data.frame(do.call(rbind, models_list))
# fund_coefficients$FundId <- names(models_list)
# colnames(fund_coefficients)[1] <- "intercept"
# new_names <- paste("index", 1:11, sep="_")
# colnames(fund_coefficients)[2:12] <- new_names

###### Benchmark-Mapping ###### 
#Umwandlung der Matrix X0_hat in einen Dataframe  
X0_hat_data <- data.frame(X0_hat)
X0_hat_data <- X0_hat_data[,2:12]
X0_hat_data$Date <- funds$Date
X0_hat_data$Date <- as.Date(X0_hat_data$Date)

#Zuordnung orthogonalisierte Benchmark Returns und Regressionskoeffizienten je Beobachtung (FundId - Date)
fund_benchmark_returns <- md_funds[,1:2]
fund_benchmark_returns <- fund_benchmark_returns %>%
  left_join(fund_coefficients, by = "FundId")
fund_benchmark_returns$intercept <- NULL
fund_benchmark_returns <- merge(X0_hat_data, fund_benchmark_returns, by = "Date", all = TRUE)
fund_benchmark_returns <- fund_benchmark_returns[,c(13,1,2:12,15:25)]

#Berechnung der spezifischen Benchmark Returns (Koeffizienten * orthogonalisierte Benchmark Returns) je Beobachtung
index_x_cols <- grep("index_[0-9]+\\.x", names(fund_benchmark_returns), value = TRUE)
index_y_cols <- gsub("\\.x$", ".y", index_x_cols)  
new_index_cols <- gsub("\\.x$", "", gsub("^index_", "", index_x_cols))  
for (i in seq_along(index_x_cols)) {
  x_col <- index_x_cols[i]
  y_col <- index_y_cols[i]
  new_col <- new_index_cols[i]
  if (y_col %in% names(fund_benchmark_returns)) {
    fund_benchmark_returns[[new_col]] <- fund_benchmark_returns[[x_col]] * fund_benchmark_returns[[y_col]]
  }
}
fund_benchmark_returns <- fund_benchmark_returns[,c(2,1, 25:35)]
fund_benchmark_returns <- fund_benchmark_returns %>%
  rename_with(~ paste("index", 1:11, sep = "_"), .cols = 3:13)
fund_benchmark_returns$benchmark_return <- rowSums(fund_benchmark_returns[, 3:13], na.rm = TRUE)

###### Benchmark-Mapping ######

#Zuordnung von Benchmark Returns zu Gross Excess Returns
adjusted_gross_alpha <- left_join(fund_benchmark_returns, sorted_funds[, c("FundId", "Date", "Gross_Excess_Return")], 
                                  by = c("FundId", "Date"))
adjusted_gross_alpha <- adjusted_gross_alpha[,c(1,2,15,14)]

#Berechnung der Gross Alphas (Gross Excess Return minus der Summe der Benchmark Returns)
adjusted_gross_alpha <- adjusted_gross_alpha %>%
  filter(!is.na(Gross_Excess_Return))
adjusted_gross_alpha$Gross_Alpha <- adjusted_gross_alpha$Gross_Excess_Return - adjusted_gross_alpha$benchmark_return
adjusted_gross_alpha <- adjusted_gross_alpha[,c(1:2,5,3,4)]

The issue with the value-weighting calculation of Gross_Alpha in the provided code is likely due to the fact that the code does not explicitly perform a value-weighted average calculation based on the number of observations per FundId. Instead, the code calculates Gross_Alpha by subtracting the benchmark return from the gross excess return for each FundId and Date combination. This results in an equal-weighted measure because it treats each FundId equally, regardless of the number of observations or the size of the fund.

To calculate a value-weighted Gross_Alpha, you would need to:

  1. Determine the weight for each FundId based on the number of observations or the size of the fund.
  2. Multiply the mean Gross_Alpha of each FundId by its corresponding weight.
  3. Sum these weighted Gross_Alpha values across all FundIds.
  4. Divide the sum by the total sum of weights across all FundIds to get the value-weighted average Gross_Alpha.

The code does not include these steps, which is why the value-weighting calculation is incorrect. To correct this, you would need to add the appropriate weighting logic to the calculation of Gross_Alpha. Here is an example of how you might implement the value-weighting calculation:

# Assuming 'number_of_observations' is a column that contains the number of observations per FundId
# and 'mean_gross_alpha' is a column that contains the mean Gross_Alpha per FundId

# Calculate the weight for each FundId
adjusted_gross_alpha <- adjusted_gross_alpha %>%
  group_by(FundId) %>%
  mutate(number_of_observations = n(),
         mean_gross_alpha = mean(Gross_Alpha, na.rm = TRUE)) %>%
  ungroup()

# Calculate the total sum of weights across all FundIds
total_weight <- sum(adjusted_gross_alpha$number_of_observations, na.rm = TRUE)

# Calculate the value-weighted Gross_Alpha
adjusted_gross_alpha <- adjusted_gross_alpha %>%
  mutate(weighted_gross_alpha = (number_of_observations * mean_gross_alpha) / total_weight)

# Sum the weighted Gross_Alpha values to get the value-weighted average Gross_Alpha
value_weighted_gross_alpha <- sum(adjusted_gross_alpha$weighted_gross_alpha, na.rm = TRUE)

This code snippet adds the necessary steps to calculate the value-weighted average Gross_Alpha. It first calculates the weight for each FundId based on the number of observations, then calculates the weighted Gross_Alpha for each FundId, and finally sums these values to get the overall value-weighted average.