Open LilMichelangelo2 opened 2 months ago
The task involves implementing a panel regression with fixed effects and clustered standard errors using the feols
function from the fixest
package in R. The user has provided the regression formula and the variables to be included in the model. The solution will involve writing R code that correctly specifies the model, including the fixed effects and clustering as requested.
The implementation will consist of the following steps in the panel_regression_analysis.R
file:
fixest
library.realised_regression_data
is loaded into the R environment.feols
function, including the fixed effects (Fund_Family + TIMExOBJECTIVE
) and clustering (cluster = c("FundId", "Date")
).realised_regression
.Here is the R code to implement the solution:
# Load the necessary library
library(fixest)
# Assuming the dataset 'realised_regression_data' is already loaded or available in the environment.
# Define and run the panel regression model with fixed effects and clustering
realised_regression <- feols(Transformed_Value_Added ~ L1_team
+ IMR + log(L1_Fund_Age) + L1_Expense_Ratio
+ L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow
+ L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
+ log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
data = realised_regression_data, cluster = c("FundId", "Date"))
# Output the results
summary(realised_regression)
This code should be placed in the panel_regression_analysis.R
file, and it will execute the panel regression as per the user's request. The summary
function call at the end will display the regression results, including the coefficients, standard errors, and statistical significance.
Click here to create a Pull Request with the proposed solution
Files used for this task:
@autopilot is this a panel regression?
realised_regression <- feols(Transformed_Value_Added ~ L1_team
+ IMR + log(L1_Fund_Age) + L1_Expense_Ratio
+ L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow
+ L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
+ log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
data = realised_regression_data, cluster = c("FundId", "Date"))
@autopilot is this a panel regression?
realised_regression <- feols(Transformed_Value_Added ~ L1_team
+ IMR + log(L1_Fund_Age) + L1_Expense_Ratio
+ L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow
+ L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
+ log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
data = realised_regression_data, cluster = c("FundId", "Date"))
Yes, the code you've provided is for a panel regression. The feols
function from the fixest
package in R is used to estimate fixed effects models, which are a common approach in panel data analysis. In your model, Fund_Family + TIMExOBJECTIVE
are included as fixed effects, and the standard errors are clustered by FundId
and Date
, which is typical in panel data to account for within-group correlation.
@autopilot correct and improve on the following code: Research how the effect of the L1_team dummy differs for the lowest tercile L1_Fund_Size versus the highest tercile by L1_Fund_Size
realised_regression <- feols(Transformed_Value_Added ~ L1_team
+ log(L1_Fund_Age) + L1_Expense_Ratio
+ L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow
+ L1_Cash_Holdings + L1_Equity_Holdings + log(L1_Tenure) + L1_Female
+ log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
data = realised_regression_data, cluster = c("FundId", "Time"))
@autopilot correct and improve on the following code: Research how the effect of the L1_team dummy differs for the lowest tercile L1_Fund_Size versus the highest tercile by L1_Fund_Size
realised_regression <- feols(Transformed_Value_Added ~ L1_team
+ log(L1_Fund_Age) + L1_Expense_Ratio
+ L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow
+ L1_Cash_Holdings + L1_Equity_Holdings + log(L1_Tenure) + L1_Female
+ log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
data = realised_regression_data, cluster = c("FundId", "Time"))
To research how the effect of the L1_team
dummy differs for the lowest tercile of L1_Fund_Size
versus the highest tercile, you will need to create interaction terms between L1_team
and dummy variables that indicate whether L1_Fund_Size
is in the lowest or highest tercile. Here's the corrected and improved code:
# Load the necessary library
library(fixest)
# Assuming the dataset 'realised_regression_data' is already loaded or available in the environment.
# Create tercile indicators for L1_Fund_Size
realised_regression_data$tercile_low <- with(realised_regression_data, as.numeric(L1_Fund_Size <= quantile(L1_Fund_Size, probs = 1/3)))
realised_regression_data$tercile_high <- with(realised_regression_data, as.numeric(L1_Fund_Size > quantile(L1_Fund_Size, probs = 2/3)))
# Define and run the panel regression model with interaction terms and clustering
realised_regression <- feols(Transformed_Value_Added ~ L1_team * tercile_low + L1_team * tercile_high
+ log(L1_Fund_Age) + L1_Expense_Ratio
+ L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow
+ L1_Cash_Holdings + L1_Equity_Holdings + log(L1_Tenure) + L1_Female
+ log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
data = realised_regression_data, cluster = c("FundId", "Time"))
# Output the results
summary(realised_regression)
This code will estimate the effect of L1_team
separately for the lowest and highest terciles of L1_Fund_Size
. The interaction terms L1_team * tercile_low
and L1_team * tercile_high
will allow you to see if the effect of L1_team
is significantly different across these groups. The quantile
function is used to determine the cutoff points for the terciles. The summary
function will display the regression results, including the interaction terms.
autopilot extend the code in R: add the column L1_team to every observation in scalability_full. L1_team takes the values of 0 or 1, depending on which value has the most observations per FundId per list element in innovator_list. Then, assign to each FundId the correct value. The Date per FundId is the list element name in innovator_list.
scalability_list <- lapply(innovator_list, function(df) {
df %>%
mutate(Neg_L1_Fund_Size_log = -L1_Fund_Size_log) %>%
group_by(FundId) %>%
do({
tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
}) %>%
ungroup()
})
scalability_full <- bind_rows(scalability_list, .id = "Date")
scalability_full$Date <- as.Date(scalability_full$Date)
autopilot extend the code in R: add the column L1_team to every observation in scalability_full. L1_team takes the values of 0 or 1, depending on which value has the most observations per FundId per list element in innovator_list. Then, assign to each FundId the correct value. The Date per FundId is the list element name in innovator_list.
scalability_list <- lapply(innovator_list, function(df) {
df %>%
mutate(Neg_L1_Fund_Size_log = -L1_Fund_Size_log) %>%
group_by(FundId) %>%
do({
tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
}) %>%
ungroup()
})
scalability_full <- bind_rows(scalability_list, .id = "Date")
scalability_full$Date <- as.Date(scalability_full$Date)
To extend the code in R and add the L1_team
column to every observation in scalability_full
, you can follow these steps:
innovator_list
and for each FundId
within each list element (which corresponds to a Date), determine the most common value of L1_team
.FundId
and Date
to the most common L1_team
value.scalability_full
to assign the correct L1_team
value to each observation.Here's the extended R code:
library(dplyr)
library(purrr)
# Step 1: Create a mapping of FundId and Date to the most common L1_team value
team_mapping_list <- lapply(names(innovator_list), function(date) {
df <- innovator_list[[date]]
df %>%
group_by(FundId) %>%
summarise(L1_team = which.max(table(L1_team)) - 1) %>%
mutate(Date = as.Date(date))
})
# Combine the list into a single data frame
team_mapping <- bind_rows(team_mapping_list)
# Step 2: Join the mapping with scalability_full to assign L1_team values
scalability_full <- scalability_full %>%
left_join(team_mapping, by = c("FundId", "Date"))
# The rest of your existing code remains the same
scalability_list <- lapply(innovator_list, function(df) {
df %>%
mutate(Neg_L1_Fund_Size_log = -L1_Fund_Size_log) %>%
group_by(FundId) %>%
do({
tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
}) %>%
ungroup()
})
scalability_full <- bind_rows(scalability_list, .id = "Date")
scalability_full$Date <- as.Date(scalability_full$Date)
This code will add the L1_team
column to scalability_full
with the correct values based on the most common L1_team
value per FundId
per Date
in innovator_list
. The which.max(table(L1_team)) - 1
expression is used to find the most common value of L1_team
(assuming L1_team
takes values 0 or 1). The - 1
is necessary because which.max
will return 1 for the first most common value and 2 for the second, so we subtract 1 to get the actual L1_team
value.
@autopilot i have the data frame md_final where FundId and Date uniquely identify an observation. The column "team" takes the value of 1 or 0. I want you to split FundId into two based on the value of md_final$team. If a FundId has observations with both md_final$team == 0 and md_final$team == 1, then create a new FundId with the md_final$team == 0 observations and put a "_Switch" after FundId. Remove these observations from the original FundId observations
@autopilot i have the data frame md_final where FundId and Date uniquely identify an observation. The column "team" takes the value of 1 or 0. I want you to split FundId into two based on the value of md_final$team. If a FundId has observations with both md_final$team == 0 and md_final$team == 1, then create a new FundId with the md_final$team == 0 observations and put a "_Switch" after FundId. Remove these observations from the original FundId observations
To achieve the splitting of FundId
based on the value of md_final$team
and create a new FundId
with a "_Switch" suffix for those with both team == 0
and team == 1
, you can use the following R code:
library(dplyr)
# Identify FundIds with both team == 0 and team == 1
funds_with_switch <- md_final %>%
group_by(FundId) %>%
filter(all(c(0, 1) %in% team)) %>%
pull(FundId) %>%
unique()
# Create a new dataframe with "_Switch" appended to FundIds that have both team values
md_final_switch <- md_final %>%
filter(FundId %in% funds_with_switch & team == 0) %>%
mutate(FundId = paste0(FundId, "_Switch"))
# Remove the team == 0 observations from the original FundIds that have both team values
md_final <- md_final %>%
filter(!(FundId %in% funds_with_switch & team == 0))
# Combine the original and the "_Switch" dataframes
md_final <- bind_rows(md_final, md_final_switch)
# The resulting md_final now has the split FundIds as requested
This code first identifies FundId
s that have observations with both team == 0
and team == 1
. Then it creates a new dataframe with these FundId
s and a "_Switch" suffix where team == 0
. After that, it removes the team == 0
observations from the original FundId
s that have both team values. Finally, it combines the original dataframe with the new "_Switch" dataframe to get the final result.
@autopilot some FundId's have "_Switch" as postfix. ignore this postfix when checking for uniqueness with n_funds, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.
compute_additional_stats <- function(skill, value_creation_filtered) {
percentage_negative <- sprintf("%.2f%%", 100 * mean(skill$Average_Skill < 0, na.rm = TRUE))
n_funds <- format(length(unique(skill$FundId)), big.mark = ",", scientific = FALSE)
obs <- format(nrow(value_creation_filtered), big.mark = ",", scientific = FALSE)
list("Percent with less than zero" = percentage_negative, "Total number of funds" = n_funds,
"No. of obs." = obs)
}
@autopilot some FundId's have "_Switch" as postfix. ignore this postfix when checking for uniqueness with n_funds, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.
compute_additional_stats <- function(skill, value_creation_filtered) {
percentage_negative <- sprintf("%.2f%%", 100 * mean(skill$Average_Skill < 0, na.rm = TRUE))
n_funds <- format(length(unique(skill$FundId)), big.mark = ",", scientific = FALSE)
obs <- format(nrow(value_creation_filtered), big.mark = ",", scientific = FALSE)
list("Percent with less than zero" = percentage_negative, "Total number of funds" = n_funds,
"No. of obs." = obs)
}
To modify the compute_additional_stats
function so that it ignores the "_Switch" postfix when counting unique FundId
s, you can use the gsub
function to remove the "_Switch" part from the FundId
before counting unique values. Here's the updated function:
compute_additional_stats <- function(skill, value_creation_filtered) {
percentage_negative <- sprintf("%.2f%%", 100 * mean(skill$Average_Skill < 0, na.rm = TRUE))
# Remove "_Switch" from FundId and then count unique values
n_funds <- format(length(unique(gsub("_Switch", "", skill$FundId))), big.mark = ",", scientific = FALSE)
obs <- format(nrow(value_creation_filtered), big.mark = ",", scientific = FALSE)
list("Percent with less than zero" = percentage_negative, "Total number of funds" = n_funds,
"No. of obs." = obs)
}
This function will now correctly count FSUSA00007_Switch
and FSUSA00007
as a single FundId
.
@autopilot some FundId's have "_Switch" as postfix. ignore this postfix when matching FundId's, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.
md_data <- md_data %>%
left_join(md_obj_unique, by = c("FundId", "Date"))
@autopilot some FundId's have "_Switch" as postfix. ignore this postfix when matching FundId's, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.
md_data <- md_data %>%
left_join(md_obj_unique, by = c("FundId", "Date"))
To modify the join operation so that it ignores the "_Switch" postfix when matching FundId
s, you can create a temporary key in both data frames that strips off the "_Switch" part before performing the join. Here's the updated code:
library(dplyr)
# Create a temporary key in both data frames that excludes the "_Switch" postfix
md_data <- md_data %>%
mutate(FundId_key = gsub("_Switch", "", FundId))
md_obj_unique <- md_obj_unique %>%
mutate(FundId_key = gsub("_Switch", "", FundId))
# Perform the left join using the temporary key and Date
md_data <- md_data %>%
left_join(md_obj_unique, by = c("FundId_key" = "FundId_key", "Date"))
# Remove the temporary key column if it's no longer needed
md_data <- md_data %>%
select(-FundId_key)
This code will treat FSUSA00007_Switch
and FSUSA00007
as the same FundId
when performing the join operation. After the join, the temporary key column is removed.
@autopilot some FundId's in md_data have "_Switch" as postfix. ignore this postfix when matching FundId's from md_obj_unique, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.
md_data <- md_data %>%
left_join(md_obj_unique, by = c("FundId", "Date"))
@autopilot some FundId's in md_data have "_Switch" as postfix. ignore this postfix when matching FundId's from md_obj_unique, i.e., count FSUSA00007_Switch and FSUSA00007 as only one FundId.
md_data <- md_data %>%
left_join(md_obj_unique, by = c("FundId", "Date"))
To modify the join operation so that it ignores the "_Switch" postfix when matching FundId
s, you can create a temporary key in both data frames that strips off the "_Switch" part before performing the join. Here's the updated code:
library(dplyr)
# Create a temporary key in md_data that excludes the "_Switch" postfix
md_data <- md_data %>%
mutate(FundId_key = gsub("_Switch", "", FundId))
# Perform the left join using the temporary key and Date
md_data <- md_data %>%
left_join(md_obj_unique, by = c("FundId_key" = "FundId", "Date"))
# Remove the temporary key column if it's no longer needed
md_data <- md_data %>%
select(-FundId_key)
This code will treat FSUSA00007_Switch
and FSUSA00007
as the same FundId
when performing the join operation. After the join, the temporary key column is removed.
@autopilot instead of creating a list with 60-month observation windows, use the entire panel dataset. proceed with calculating scalability estimates per Date / FundId by regressing net_alpha ~ Neg_L1_Fund_Size_log + size_innovation on the prior 60 months of data per FundId. There must be at least 36 observations in the last 60 months per Date.
log_size <- adjusted_net_alpha
log_size <- log_size %>%
arrange(Date, FundId)
log_size$Fund_Size_log <- log(log_size$Fund_Size)
log_size$L1_Fund_Size_log <- log(log_size$L1_Fund_Size)
log_size$Fund_Size <- NULL
log_size$L1_Fund_Size <- NULL
log_size <- log_size %>% arrange(Date)
# 60-monatiges Schätzungsfenster formatieren
min_valid_date <- min(log_size$Date) %m+% months(59)
valid_dates <- unique(log_size$Date[log_size$Date >= min_valid_date])
log_size_list <- lapply(valid_dates, function(current_date) {
start_date <- current_date %m-% months(59)
subset_data <- log_size %>%
filter(Date >= start_date & Date <= current_date)
return(subset_data)
})
# Fonds mit < 60 Beobachtungen entfernen
log_size_list <- lapply(log_size_list, function(df) {
df %>%
group_by(FundId) %>%
mutate(Count = n()) %>%
filter(Count >= 36) %>%
dplyr::select(-Count) %>%
ungroup()
})
# Daten formatieren
names(log_size_list) <- as.character(ceiling_date(valid_dates %m+% months(1), "month") - days(1))
log_size_list[[385]] <- NULL
# Erwartung des Skalierbarkeitsparameters berechnen nach Van Binsbergen et al. (2023), S. 30f
scalability_list <- lapply(log_size_list, function(df) {
df %>%
mutate(Neg_L1_Fund_Size_log = -L1_Fund_Size_log) %>%
group_by(FundId) %>%
do({
tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
}) %>%
ungroup()
})
scalability_full <- bind_rows(scalability_list, .id = "Date")
scalability_full$Date <- as.Date(scalability_full$Date)
@autopilot instead of creating a list with 60-month observation windows, use the entire panel dataset. proceed with calculating scalability estimates per Date / FundId by regressing net_alpha ~ Neg_L1_Fund_Size_log + size_innovation on the prior 60 months of data per FundId. There must be at least 36 observations in the last 60 months per Date.
log_size <- adjusted_net_alpha
log_size <- log_size %>%
arrange(Date, FundId)
log_size$Fund_Size_log <- log(log_size$Fund_Size)
log_size$L1_Fund_Size_log <- log(log_size$L1_Fund_Size)
log_size$Fund_Size <- NULL
log_size$L1_Fund_Size <- NULL
log_size <- log_size %>% arrange(Date)
# 60-monatiges Schätzungsfenster formatieren
min_valid_date <- min(log_size$Date) %m+% months(59)
valid_dates <- unique(log_size$Date[log_size$Date >= min_valid_date])
log_size_list <- lapply(valid_dates, function(current_date) {
start_date <- current_date %m-% months(59)
subset_data <- log_size %>%
filter(Date >= start_date & Date <= current_date)
return(subset_data)
})
# Fonds mit < 60 Beobachtungen entfernen
log_size_list <- lapply(log_size_list, function(df) {
df %>%
group_by(FundId) %>%
mutate(Count = n()) %>%
filter(Count >= 36) %>%
dplyr::select(-Count) %>%
ungroup()
})
# Daten formatieren
names(log_size_list) <- as.character(ceiling_date(valid_dates %m+% months(1), "month") - days(1))
log_size_list[[385]] <- NULL
# Erwartung des Skalierbarkeitsparameters berechnen nach Van Binsbergen et al. (2023), S. 30f
scalability_list <- lapply(log_size_list, function(df) {
df %>%
mutate(Neg_L1_Fund_Size_log = -L1_Fund_Size_log) %>%
group_by(FundId) %>%
do({
tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
}) %>%
ungroup()
})
scalability_full <- bind_rows(scalability_list, .id = "Date")
scalability_full$Date <- as.Date(scalability_full$Date)
To calculate scalability estimates per Date/FundId using the entire panel dataset and regressing net_alpha ~ Neg_L1_Fund_Size_log + size_innovation
on the prior 60 months of data per FundId, with at least 36 observations in the last 60 months per Date, you can use the following R code:
library(dplyr)
library(lubridate)
library(broom)
# Assuming log_size is already defined and contains the necessary columns
log_size <- log_size %>%
arrange(Date, FundId) %>%
mutate(
Fund_Size_log = log(Fund_Size),
L1_Fund_Size_log = log(L1_Fund_Size),
Neg_L1_Fund_Size_log = -L1_Fund_Size_log
) %>%
select(-Fund_Size, -L1_Fund_Size)
# Define a function to perform the rolling regression
perform_rolling_regression <- function(df, current_date) {
start_date <- current_date %m-% months(59)
subset_data <- df %>%
filter(Date >= start_date & Date <= current_date) %>%
group_by(FundId) %>%
filter(n() >= 36) # Ensure at least 36 observations in the last 60 months
# Perform the regression for each FundId
regression_results <- subset_data %>%
do({
tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
}) %>%
ungroup() %>%
mutate(Date = current_date) # Add the current date to the results
return(regression_results)
}
# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x))
# The resulting scalability_estimates dataframe contains the regression results per Date/FundId
This code defines a function perform_rolling_regression
that takes a dataframe and a date, filters the data for the 60 months leading up to that date, ensures there are at least 36 observations for each FundId, and then performs the regression. The map_df
function from the purrr
package is used to apply this function to each unique date in the dataset, and the results are combined into a single dataframe scalability_estimates
.
@autopilot is this the right approach to ensure that size_innovation has a zero mean for FundId over all observations in each list element?
names(pSize_list) <- names(log_size_list)
innovator_list <- lapply(pSize_list, function(df) {
df <- df %>%
mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log))
return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
df <- df %>%
group_by(FundId) %>%
mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
ungroup()
return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
df <- df %>%
mutate(size_innovation = initial_innovation - mean_innovation)
return(df)
})
@autopilot is this the right approach to ensure that size_innovation has a zero mean for FundId over all observations in each list element?
names(pSize_list) <- names(log_size_list)
innovator_list <- lapply(pSize_list, function(df) {
df <- df %>%
mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log))
return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
df <- df %>%
group_by(FundId) %>%
mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
ungroup()
return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
df <- df %>%
mutate(size_innovation = initial_innovation - mean_innovation)
return(df)
})
The approach you've taken in the R code is correct for ensuring that size_innovation
has a zero mean for each FundId
over all observations within each list element. You are calculating an initial innovation metric, then computing the mean of this metric for each FundId
, and finally centering the initial innovation by subtracting the mean to get size_innovation
. This process will result in size_innovation
having a mean of zero for each FundId
within each list element.
Here's a breakdown of the steps you've taken:
initial_innovation
for each observation as the difference between Fund_Size_log
and PMUE * L1_Fund_Size_log
.FundId
and calculate the mean of initial_innovation
for each FundId
.mean_innovation
from initial_innovation
to get size_innovation
, which will have a mean of zero for each FundId
.The code is well-structured and uses the dplyr
package effectively to perform the necessary transformations. If pSize_list
and log_size_list
have the same structure and PMUE
is defined correctly, your approach should work as intended.
@autopilot extend and adjust the code 2 in R to integrate the code 1, i.e., the size_innovation calculation
Code 2:
log_size <- log_size %>%
arrange(Date, FundId) %>%
mutate(
Fund_Size_log = log(Fund_Size),
L1_Fund_Size_log = log(L1_Fund_Size),
Neg_L1_Fund_Size_log = -L1_Fund_Size_log
) %>%
select(-Fund_Size, -L1_Fund_Size)
# Define a function to perform the rolling regression
perform_rolling_regression <- function(df, current_date) {
start_date <- current_date %m-% months(59)
subset_data <- df %>%
filter(Date >= start_date & Date <= current_date) %>%
group_by(FundId) %>%
filter(n() >= 36) # Ensure at least 36 observations in the last 60 months
# Perform the regression for each FundId
regression_results <- subset_data %>%
do({
tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
}) %>%
ungroup() %>%
mutate(Date = current_date) # Add the current date to the results
return(regression_results)
}
# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x))
Code 1:
innovator_list <- lapply(pSize_list, function(df) {
df <- df %>%
mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log))
return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
df <- df %>%
group_by(FundId) %>%
mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
ungroup()
return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
df <- df %>%
mutate(size_innovation = initial_innovation - mean_innovation)
return(df)
})
@autopilot extend and adjust the code 2 in R to integrate the code 1, i.e., the size_innovation calculation
Code 2:
log_size <- log_size %>%
arrange(Date, FundId) %>%
mutate(
Fund_Size_log = log(Fund_Size),
L1_Fund_Size_log = log(L1_Fund_Size),
Neg_L1_Fund_Size_log = -L1_Fund_Size_log
) %>%
select(-Fund_Size, -L1_Fund_Size)
# Define a function to perform the rolling regression
perform_rolling_regression <- function(df, current_date) {
start_date <- current_date %m-% months(59)
subset_data <- df %>%
filter(Date >= start_date & Date <= current_date) %>%
group_by(FundId) %>%
filter(n() >= 36) # Ensure at least 36 observations in the last 60 months
# Perform the regression for each FundId
regression_results <- subset_data %>%
do({
tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
}) %>%
ungroup() %>%
mutate(Date = current_date) # Add the current date to the results
return(regression_results)
}
# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x))
Code 1:
innovator_list <- lapply(pSize_list, function(df) {
df <- df %>%
mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log))
return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
df <- df %>%
group_by(FundId) %>%
mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
ungroup()
return(df)
})
innovator_list <- lapply(innovator_list, function(df) {
df <- df %>%
mutate(size_innovation = initial_innovation - mean_innovation)
return(df)
})
To integrate the size_innovation calculation from Code 1 into Code 2, you need to adjust the perform_rolling_regression
function to include the steps for calculating initial_innovation
, mean_innovation
, and size_innovation
. Here's the extended and adjusted Code 2:
library(dplyr)
library(lubridate)
library(purrr)
library(broom)
log_size <- log_size %>%
arrange(Date, FundId) %>%
mutate(
Fund_Size_log = log(Fund_Size),
L1_Fund_Size_log = log(L1_Fund_Size),
Neg_L1_Fund_Size_log = -L1_Fund_Size_log
) %>%
select(-Fund_Size, -L1_Fund_Size)
# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, current_date, PMUE) {
start_date <- current_date %m-% months(59)
subset_data <- df %>%
filter(Date >= start_date & Date <= current_date) %>%
group_by(FundId) %>%
filter(n() >= 36) # Ensure at least 36 observations in the last 60 months
# Calculate size_innovation within the rolling window
subset_data <- subset_data %>%
mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
group_by(FundId) %>%
mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
mutate(size_innovation = initial_innovation - mean_innovation) %>%
ungroup()
# Perform the regression for each FundId
regression_results <- subset_data %>%
do({
tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
}) %>%
ungroup() %>%
mutate(Date = current_date) # Add the current date to the results
return(regression_results)
}
# Assuming PMUE is defined and available
# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x, PMUE))
In this adjusted code, the perform_rolling_regression
function now includes the steps to calculate size_innovation
for each FundId
within the rolling window of 60 months. The PMUE
parameter is passed to the function, which is used in the calculation of initial_innovation
. After calculating size_innovation
, the function proceeds with the regression analysis as before. The map_df
function is used to apply this updated rolling regression function to each unique date in the dataset.
@autopilot scalability is to be calculated for each FundId Date combination, not just Date.
library(dplyr)
library(lubridate)
library(purrr)
library(broom)
log_size <- log_size %>%
arrange(Date, FundId) %>%
mutate(
Fund_Size_log = log(Fund_Size),
L1_Fund_Size_log = log(L1_Fund_Size),
Neg_L1_Fund_Size_log = -L1_Fund_Size_log
) %>%
select(-Fund_Size, -L1_Fund_Size)
# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, current_date, PMUE) {
start_date <- current_date %m-% months(59)
subset_data <- df %>%
filter(Date >= start_date & Date <= current_date) %>%
group_by(FundId) %>%
filter(n() >= 36) # Ensure at least 36 observations in the last 60 months
# Calculate size_innovation within the rolling window
subset_data <- subset_data %>%
mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
group_by(FundId) %>%
mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
mutate(size_innovation = initial_innovation - mean_innovation) %>%
ungroup()
# Perform the regression for each FundId
regression_results <- subset_data %>%
do({
tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
}) %>%
ungroup() %>%
mutate(Date = current_date) # Add the current date to the results
return(regression_results)
}
# Assuming PMUE is defined and available
# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x, PMUE))
@autopilot scalability is to be calculated for each FundId Date combination, not just Date.
library(dplyr)
library(lubridate)
library(purrr)
library(broom)
log_size <- log_size %>%
arrange(Date, FundId) %>%
mutate(
Fund_Size_log = log(Fund_Size),
L1_Fund_Size_log = log(L1_Fund_Size),
Neg_L1_Fund_Size_log = -L1_Fund_Size_log
) %>%
select(-Fund_Size, -L1_Fund_Size)
# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, current_date, PMUE) {
start_date <- current_date %m-% months(59)
subset_data <- df %>%
filter(Date >= start_date & Date <= current_date) %>%
group_by(FundId) %>%
filter(n() >= 36) # Ensure at least 36 observations in the last 60 months
# Calculate size_innovation within the rolling window
subset_data <- subset_data %>%
mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
group_by(FundId) %>%
mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
mutate(size_innovation = initial_innovation - mean_innovation) %>%
ungroup()
# Perform the regression for each FundId
regression_results <- subset_data %>%
do({
tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = .)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
}) %>%
ungroup() %>%
mutate(Date = current_date) # Add the current date to the results
return(regression_results)
}
# Assuming PMUE is defined and available
# Calculate scalability estimates for each Date
scalability_estimates <- map_df(unique(log_size$Date), ~perform_rolling_regression(log_size, .x, PMUE))
To calculate scalability for each FundId and Date combination, you need to adjust the code to iterate over each FundId and Date pair rather than just Date. Here's the updated code:
library(dplyr)
library(lubridate)
library(purrr)
library(broom)
log_size <- log_size %>%
arrange(Date, FundId) %>%
mutate(
Fund_Size_log = log(Fund_Size),
L1_Fund_Size_log = log(L1_Fund_Size),
Neg_L1_Fund_Size_log = -L1_Fund_Size_log
) %>%
select(-Fund_Size, -L1_Fund_Size)
# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, fund_id, current_date, PMUE) {
start_date <- current_date %m-% months(59)
subset_data <- df %>%
filter(FundId == fund_id & Date >= start_date & Date <= current_date) %>%
filter(n() >= 36) # Ensure at least 36 observations in the last 60 months
# Calculate size_innovation within the rolling window
subset_data <- subset_data %>%
mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
mutate(size_innovation = initial_innovation - mean_innovation)
# Perform the regression
regression_results <- tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = subset_data)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
return(regression_results %>% mutate(FundId = fund_id, Date = current_date))
}
# Assuming PMUE is defined and available
# Calculate scalability estimates for each FundId and Date combination
fund_dates <- unique(log_size %>% select(FundId, Date))
scalability_estimates <- map_df(1:nrow(fund_dates), function(i) {
with(fund_dates[i,], perform_rolling_regression(log_size, FundId, Date, PMUE))
})
# The resulting scalability_estimates dataframe contains the regression results per FundId and Date
In this updated code, we first create a data frame fund_dates
that contains all unique combinations of FundId and Date. Then, we use map_df
to apply the perform_rolling_regression
function to each row of fund_dates
. The function now takes an additional parameter fund_id
to filter the data for the specific FundId and Date combination. The regression is performed for each combination, and the results are combined into the scalability_estimates
data frame.
@autopilot make this code more efficient:
library(dplyr)
library(lubridate)
library(purrr)
library(broom)
log_size <- log_size %>%
arrange(Date, FundId) %>%
mutate(
Fund_Size_log = log(Fund_Size),
L1_Fund_Size_log = log(L1_Fund_Size),
Neg_L1_Fund_Size_log = -L1_Fund_Size_log
) %>%
select(-Fund_Size, -L1_Fund_Size)
# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, fund_id, current_date, PMUE) {
start_date <- current_date %m-% months(59)
subset_data <- df %>%
filter(FundId == fund_id & Date >= start_date & Date <= current_date) %>%
filter(n() >= 36) # Ensure at least 36 observations in the last 60 months
# Calculate size_innovation within the rolling window
subset_data <- subset_data %>%
mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
mutate(size_innovation = initial_innovation - mean_innovation)
# Perform the regression
regression_results <- tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = subset_data)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
return(regression_results %>% mutate(FundId = fund_id, Date = current_date))
}
# Assuming PMUE is defined and available
# Calculate scalability estimates for each FundId and Date combination
fund_dates <- unique(log_size %>% select(FundId, Date))
scalability_estimates <- map_df(1:nrow(fund_dates), function(i) {
with(fund_dates[i,], perform_rolling_regression(log_size, FundId, Date, PMUE))
})
@autopilot make this code more efficient:
library(dplyr)
library(lubridate)
library(purrr)
library(broom)
log_size <- log_size %>%
arrange(Date, FundId) %>%
mutate(
Fund_Size_log = log(Fund_Size),
L1_Fund_Size_log = log(L1_Fund_Size),
Neg_L1_Fund_Size_log = -L1_Fund_Size_log
) %>%
select(-Fund_Size, -L1_Fund_Size)
# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(df, fund_id, current_date, PMUE) {
start_date <- current_date %m-% months(59)
subset_data <- df %>%
filter(FundId == fund_id & Date >= start_date & Date <= current_date) %>%
filter(n() >= 36) # Ensure at least 36 observations in the last 60 months
# Calculate size_innovation within the rolling window
subset_data <- subset_data %>%
mutate(initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log)) %>%
mutate(mean_innovation = mean(initial_innovation, na.rm = TRUE)) %>%
mutate(size_innovation = initial_innovation - mean_innovation)
# Perform the regression
regression_results <- tryCatch(
tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = subset_data)),
error = function(e) {
tibble(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
return(regression_results %>% mutate(FundId = fund_id, Date = current_date))
}
# Assuming PMUE is defined and available
# Calculate scalability estimates for each FundId and Date combination
fund_dates <- unique(log_size %>% select(FundId, Date))
scalability_estimates <- map_df(1:nrow(fund_dates), function(i) {
with(fund_dates[i,], perform_rolling_regression(log_size, FundId, Date, PMUE))
})
To make the code more efficient, consider the following optimizations:
map_df
with row indices, directly iterate over the unique FundId and Date combinations.Here's the optimized code:
library(data.table)
library(lubridate)
# Convert log_size to a data.table for efficient operations
setDT(log_size)
log_size[, `:=`(
Fund_Size_log = log(Fund_Size),
L1_Fund_Size_log = log(L1_Fund_Size),
Neg_L1_Fund_Size_log = -L1_Fund_Size_log
)]
log_size[, c("Fund_Size", "L1_Fund_Size") := NULL]
# Define a function to perform the rolling regression with size_innovation calculation
perform_rolling_regression <- function(subset_data, PMUE) {
# Calculate size_innovation within the rolling window
subset_data[, `:=`(
initial_innovation = Fund_Size_log - (PMUE * L1_Fund_Size_log),
mean_innovation = mean(initial_innovation, na.rm = TRUE),
size_innovation = initial_innovation - mean_innovation
)]
# Perform the regression
regression_results <- tryCatch(
as.data.frame(tidy(lm(net_alpha ~ Neg_L1_Fund_Size_log + size_innovation, data = subset_data))),
error = function(e) {
data.frame(
term = NA_character_,
estimate = NA_real_,
std.error = NA_real_,
statistic = NA_real_,
p.value = NA_real_
)
}
)
return(regression_results)
}
# Assuming PMUE is defined and available
# Calculate scalability estimates for each FundId and Date combination
scalability_estimates <- rbindlist(lapply(unique(log_size[, .(FundId, Date)]), function(x) {
current_date <- x$Date
fund_id <- x$FundId
start_date <- current_date %m-% months(59)
# Efficient rolling join to get the 60-month window for the current FundId and Date
subset_data <- log_size[FundId == fund_id & Date >= start_date & Date <= current_date]
if (nrow(subset_data) >= 36) {
regression_results <- perform_rolling_regression(subset_data, PMUE)
regression_results[, `:=`(FundId = fund_id, Date = current_date)]
return(regression_results)
} else {
return(NULL)
}
}), fill = TRUE)
# Convert scalability_estimates back to a data.frame if needed
scalability_estimates <- as.data.frame(scalability_estimates)
This optimized code uses data.table for faster data manipulation and performs the rolling regression only if there are at least 36 observations in the 60-month window. The rbindlist
function is used to combine the results, and the fill = TRUE
argument ensures that all columns are present even if some regressions return NULL (due to insufficient observations).
@autopilot the outcome Gross_Alpha for the following code is correct for equal-weighting (the mean of the mean of every FundId) but wrong for value-weighting (number of observations per FundId * mean Gross_Alpha of FundId / Sum of observations across all FundIds). Comment on why that is:
library(tidyr)
library(broom)
library(dplyr)
library(stringr)
library(purrr)
library(lubridate)
library(zoo)
library(lmtest)
md_final <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/MS/md_final.csv")
md_final$Date <- as.Date(md_final$Date)
md_final <- md_final[,2:ncol(md_final)]
md_funds <- md_final
md_funds$Date <- as.Date(md_funds$Date)
md_funds$Date <- ceiling_date(md_funds$Date, "month") - days(1)
md_funds <- md_funds[,c(1,2,18)]
#WRDS Fama-French 5 Factors Plus Momentum - Monthly Frequency
rf <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/External Data/rf.csv")
rf$date <- as.Date(rf$date)
rf$date <- ceiling_date(rf$date, "month") - days(1)
rf$dateff <- NULL
rf <- rf %>%
distinct(date, .keep_all = TRUE)
sorted_funds <- md_funds %>%
arrange(Date, FundId)
sorted_funds <- sorted_funds %>%
left_join(rf, by = c("Date" = "date"))
sorted_funds <- sorted_funds %>%
mutate(Gross_Excess_Return = Gross_Return - rf)
sorted_funds$Gross_Return <- NULL
sorted_funds$rf <- NULL
#####
sorted_funds <- sorted_funds %>%
filter(year(Date) <= 2014)
#####
funds <- sorted_funds %>%
pivot_wider(names_from = FundId, values_from = Gross_Excess_Return)
funds$Date <- as.character(funds$Date)
fund_return <- as.matrix(funds[,2:ncol(funds)])
#WRDS CRSP Mutual Fund Database
index <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/External Data/Vanguard.csv")
index$Index_Return <- index$mret
index$mret <- NULL
index$caldt <- as.Date(index$caldt)
index$caldt <- ceiling_date(index$caldt, "month") - days(1)
index <- index[,c(1,2,4)]
sorted_index <- index %>%
arrange(caldt, ticker)
index_mapping <- setNames(c("index_1", "index_2", "index_3", "index_4", "index_5", "index_6",
"index_7", "index_8", "index_9", "index_10", "index_11"),
c("VFINX", "VEXMX", "NAESX", "VEURX", "VPACX", "VVIAX",
"VBINX", "VEIEX", "VIMSX", "VISGX", "VISVX"))
sorted_index <- sorted_index %>%
mutate(index_no = index_mapping[as.character(ticker)])
sorted_index <- sorted_index %>%
left_join(rf, by = c("caldt" = "date"))
sorted_index$Index_Return <- as.numeric(sorted_index$Index_Return)
sorted_index <- sorted_index %>%
mutate(Excess_Index = Index_Return - rf)
sorted_index <- sorted_index[,c(4,2,6)]
#####
sorted_index <- sorted_index %>%
filter(year(caldt) <= 2014)
#####
sorted_index <- na.omit(sorted_index)
benchmark <- sorted_index %>%
pivot_wider(names_from = index_no, values_from = Excess_Index)
benchmark[, 2:12] <- lapply(benchmark[, 2:12], as.numeric)
fund_return_list <- lapply(seq_len(ncol(fund_return)), function(i) {
return_vector <- fund_return[, i]
names(return_vector) <- rownames(fund_return)
return(return_vector)
})
names(fund_return_list) <- colnames(fund_return)
# Create a matrix "X" with 12 columns (intercept, index_1 to index_11)
X <- as.matrix(cbind(1, benchmark[2:12]))
###### X0_hat ######
#Orthogonalisiere ohne Intercept
X0_hat <- X[, 2:12]
#Orthogonalisierungsverfahren
orthogonalize_variable <- function(variable, previous_variables_matrix) {
#Index des ersten non-NA-Wertes
first_obs_variable <- which(!is.na(variable))[1]
#Zeitreihenvektor (erste bis letzte Beobachtung von aktueller Variable n)
valid_length_variable <- first_obs_variable:length(variable)
#Zeitreihenvektor (erste bis letzte Beobachtungen vorheriger und orthogonalisierten Variablen n-1)
valid_length_previous_vars <- first_obs_variable:nrow(previous_variables_matrix)
#Unterstichprobe auf den Zeitraum von n
if (length(valid_length_previous_vars) > length(valid_length_variable)) {
valid_length_previous_vars <- valid_length_previous_vars[1:length(valid_length_variable)]
}
#Lineare Projektion von n auf n-1
fit <- lm(variable[valid_length_variable] ~ previous_variables_matrix[valid_length_previous_vars, , drop = FALSE])
#Die Summe aus Residuen und Konstanter (Intercept) ergibt die Werte des nten orthogonalisierten Fonds (BvB, S. 19)
intercept <- coef(fit)[1]
residuals_with_mean <- residuals(fit) + intercept
#Austausch der alten durch neue Werte
variable[valid_length_variable] <- residuals_with_mean
return(variable)
}
#Durchführung der sequentiellen Orthogonalisierung
for (j in 2:ncol(X0_hat)) {
previous_variables_matrix <- X0_hat[, 1:(j - 1), drop = FALSE]
X0_hat[, j] <- orthogonalize_variable(X0_hat[, j], previous_variables_matrix)
}
#NAs mit 0 austauschen
X0_hat[is.na(X0_hat)] <- 0
#Intercept hinzufügen
X0_hat <- cbind(1, X0_hat)
###### X0 ######
#Verfahren zum Austauschen der NAs durch Mittelwerte der orthogonalisierten Fonds Returns
impute_with_mean <- function(X) {
for (j in 1:ncol(X)) {
# Calculate mean excluding zeros and NAs
column_mean <- mean(X[X[, j] != 0, j], na.rm = TRUE)
# Replace zeros with the column mean
X[X[, j] == 0, j] <- column_mean
}
return(X)
}
#Implementierung der Mittelwerte
X0 <- impute_with_mean(X0_hat)
###### Regression ######
# X0 Matrix zu Dataframe
X0_df = as.data.frame(X0)
X0_df$Date <- funds$Date
X0_df$Date <- as.Date(X0_df$Date)
# Fonds- und Benchmarkdaten verschmelzen
X0_merged <- merge(sorted_funds, X0_df, by = "Date", all.x = TRUE)
# Berechnung der Koeffizienten via OLS Regression
results <- X0_merged %>%
group_by(FundId) %>%
do(tidy(lm(Gross_Excess_Return ~ V1 + index_1 + index_2 + index_3 + index_4 + index_5 +
index_6 + index_7 + index_8 + index_9 + index_10 + index_11 - 1 , data = .))) %>%
ungroup()
# Bau des Koeffizienten Dataframes
fund_coefficients <- results %>%
pivot_wider(names_from = term, values_from = estimate, id_cols = FundId) %>%
rename_with(~ gsub("^\\(Intercept\\)$", "V1", .x), .cols = everything()) %>%
dplyr::select(FundId, V1, starts_with("index_"))
# #Verfarhen zur Berechnung der Koeffizienten
# fit_model <- function(returns) {
# data_for_model <- data.frame(fund_return = returns, X0)
# model <- lm(fund_return ~ . - 1, data = data_for_model)
# return(coefficients(model))
# }
#
# #Berechnung der Koeffizienten
# models_list <- lapply(fund_return_list, fit_model)
#
# #Bau des Koeffizienten Dataframes
# fund_coefficients <- data.frame(do.call(rbind, models_list))
# fund_coefficients$FundId <- names(models_list)
# colnames(fund_coefficients)[1] <- "intercept"
# new_names <- paste("index", 1:11, sep="_")
# colnames(fund_coefficients)[2:12] <- new_names
###### Benchmark-Mapping ######
#Umwandlung der Matrix X0_hat in einen Dataframe
X0_hat_data <- data.frame(X0_hat)
X0_hat_data <- X0_hat_data[,2:12]
X0_hat_data$Date <- funds$Date
X0_hat_data$Date <- as.Date(X0_hat_data$Date)
#Zuordnung orthogonalisierte Benchmark Returns und Regressionskoeffizienten je Beobachtung (FundId - Date)
fund_benchmark_returns <- md_funds[,1:2]
fund_benchmark_returns <- fund_benchmark_returns %>%
left_join(fund_coefficients, by = "FundId")
fund_benchmark_returns$intercept <- NULL
fund_benchmark_returns <- merge(X0_hat_data, fund_benchmark_returns, by = "Date", all = TRUE)
fund_benchmark_returns <- fund_benchmark_returns[,c(13,1,2:12,15:25)]
#Berechnung der spezifischen Benchmark Returns (Koeffizienten * orthogonalisierte Benchmark Returns) je Beobachtung
index_x_cols <- grep("index_[0-9]+\\.x", names(fund_benchmark_returns), value = TRUE)
index_y_cols <- gsub("\\.x$", ".y", index_x_cols)
new_index_cols <- gsub("\\.x$", "", gsub("^index_", "", index_x_cols))
for (i in seq_along(index_x_cols)) {
x_col <- index_x_cols[i]
y_col <- index_y_cols[i]
new_col <- new_index_cols[i]
if (y_col %in% names(fund_benchmark_returns)) {
fund_benchmark_returns[[new_col]] <- fund_benchmark_returns[[x_col]] * fund_benchmark_returns[[y_col]]
}
}
fund_benchmark_returns <- fund_benchmark_returns[,c(2,1, 25:35)]
fund_benchmark_returns <- fund_benchmark_returns %>%
rename_with(~ paste("index", 1:11, sep = "_"), .cols = 3:13)
fund_benchmark_returns$benchmark_return <- rowSums(fund_benchmark_returns[, 3:13], na.rm = TRUE)
###### Benchmark-Mapping ######
#Zuordnung von Benchmark Returns zu Gross Excess Returns
adjusted_gross_alpha <- left_join(fund_benchmark_returns, sorted_funds[, c("FundId", "Date", "Gross_Excess_Return")],
by = c("FundId", "Date"))
adjusted_gross_alpha <- adjusted_gross_alpha[,c(1,2,15,14)]
#Berechnung der Gross Alphas (Gross Excess Return minus der Summe der Benchmark Returns)
adjusted_gross_alpha <- adjusted_gross_alpha %>%
filter(!is.na(Gross_Excess_Return))
adjusted_gross_alpha$Gross_Alpha <- adjusted_gross_alpha$Gross_Excess_Return - adjusted_gross_alpha$benchmark_return
adjusted_gross_alpha <- adjusted_gross_alpha[,c(1:2,5,3,4)]
@autopilot the outcome Gross_Alpha for the following code is correct for equal-weighting (the mean of the mean of every FundId) but wrong for value-weighting (number of observations per FundId * mean Gross_Alpha of FundId / Sum of observations across all FundIds). Comment on why that is:
library(tidyr)
library(broom)
library(dplyr)
library(stringr)
library(purrr)
library(lubridate)
library(zoo)
library(lmtest)
md_final <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/MS/md_final.csv")
md_final$Date <- as.Date(md_final$Date)
md_final <- md_final[,2:ncol(md_final)]
md_funds <- md_final
md_funds$Date <- as.Date(md_funds$Date)
md_funds$Date <- ceiling_date(md_funds$Date, "month") - days(1)
md_funds <- md_funds[,c(1,2,18)]
#WRDS Fama-French 5 Factors Plus Momentum - Monthly Frequency
rf <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/External Data/rf.csv")
rf$date <- as.Date(rf$date)
rf$date <- ceiling_date(rf$date, "month") - days(1)
rf$dateff <- NULL
rf <- rf %>%
distinct(date, .keep_all = TRUE)
sorted_funds <- md_funds %>%
arrange(Date, FundId)
sorted_funds <- sorted_funds %>%
left_join(rf, by = c("Date" = "date"))
sorted_funds <- sorted_funds %>%
mutate(Gross_Excess_Return = Gross_Return - rf)
sorted_funds$Gross_Return <- NULL
sorted_funds$rf <- NULL
#####
sorted_funds <- sorted_funds %>%
filter(year(Date) <= 2014)
#####
funds <- sorted_funds %>%
pivot_wider(names_from = FundId, values_from = Gross_Excess_Return)
funds$Date <- as.character(funds$Date)
fund_return <- as.matrix(funds[,2:ncol(funds)])
#WRDS CRSP Mutual Fund Database
index <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/External Data/Vanguard.csv")
index$Index_Return <- index$mret
index$mret <- NULL
index$caldt <- as.Date(index$caldt)
index$caldt <- ceiling_date(index$caldt, "month") - days(1)
index <- index[,c(1,2,4)]
sorted_index <- index %>%
arrange(caldt, ticker)
index_mapping <- setNames(c("index_1", "index_2", "index_3", "index_4", "index_5", "index_6",
"index_7", "index_8", "index_9", "index_10", "index_11"),
c("VFINX", "VEXMX", "NAESX", "VEURX", "VPACX", "VVIAX",
"VBINX", "VEIEX", "VIMSX", "VISGX", "VISVX"))
sorted_index <- sorted_index %>%
mutate(index_no = index_mapping[as.character(ticker)])
sorted_index <- sorted_index %>%
left_join(rf, by = c("caldt" = "date"))
sorted_index$Index_Return <- as.numeric(sorted_index$Index_Return)
sorted_index <- sorted_index %>%
mutate(Excess_Index = Index_Return - rf)
sorted_index <- sorted_index[,c(4,2,6)]
#####
sorted_index <- sorted_index %>%
filter(year(caldt) <= 2014)
#####
sorted_index <- na.omit(sorted_index)
benchmark <- sorted_index %>%
pivot_wider(names_from = index_no, values_from = Excess_Index)
benchmark[, 2:12] <- lapply(benchmark[, 2:12], as.numeric)
fund_return_list <- lapply(seq_len(ncol(fund_return)), function(i) {
return_vector <- fund_return[, i]
names(return_vector) <- rownames(fund_return)
return(return_vector)
})
names(fund_return_list) <- colnames(fund_return)
# Create a matrix "X" with 12 columns (intercept, index_1 to index_11)
X <- as.matrix(cbind(1, benchmark[2:12]))
###### X0_hat ######
#Orthogonalisiere ohne Intercept
X0_hat <- X[, 2:12]
#Orthogonalisierungsverfahren
orthogonalize_variable <- function(variable, previous_variables_matrix) {
#Index des ersten non-NA-Wertes
first_obs_variable <- which(!is.na(variable))[1]
#Zeitreihenvektor (erste bis letzte Beobachtung von aktueller Variable n)
valid_length_variable <- first_obs_variable:length(variable)
#Zeitreihenvektor (erste bis letzte Beobachtungen vorheriger und orthogonalisierten Variablen n-1)
valid_length_previous_vars <- first_obs_variable:nrow(previous_variables_matrix)
#Unterstichprobe auf den Zeitraum von n
if (length(valid_length_previous_vars) > length(valid_length_variable)) {
valid_length_previous_vars <- valid_length_previous_vars[1:length(valid_length_variable)]
}
#Lineare Projektion von n auf n-1
fit <- lm(variable[valid_length_variable] ~ previous_variables_matrix[valid_length_previous_vars, , drop = FALSE])
#Die Summe aus Residuen und Konstanter (Intercept) ergibt die Werte des nten orthogonalisierten Fonds (BvB, S. 19)
intercept <- coef(fit)[1]
residuals_with_mean <- residuals(fit) + intercept
#Austausch der alten durch neue Werte
variable[valid_length_variable] <- residuals_with_mean
return(variable)
}
#Durchführung der sequentiellen Orthogonalisierung
for (j in 2:ncol(X0_hat)) {
previous_variables_matrix <- X0_hat[, 1:(j - 1), drop = FALSE]
X0_hat[, j] <- orthogonalize_variable(X0_hat[, j], previous_variables_matrix)
}
#NAs mit 0 austauschen
X0_hat[is.na(X0_hat)] <- 0
#Intercept hinzufügen
X0_hat <- cbind(1, X0_hat)
###### X0 ######
#Verfahren zum Austauschen der NAs durch Mittelwerte der orthogonalisierten Fonds Returns
impute_with_mean <- function(X) {
for (j in 1:ncol(X)) {
# Calculate mean excluding zeros and NAs
column_mean <- mean(X[X[, j] != 0, j], na.rm = TRUE)
# Replace zeros with the column mean
X[X[, j] == 0, j] <- column_mean
}
return(X)
}
#Implementierung der Mittelwerte
X0 <- impute_with_mean(X0_hat)
###### Regression ######
# X0 Matrix zu Dataframe
X0_df = as.data.frame(X0)
X0_df$Date <- funds$Date
X0_df$Date <- as.Date(X0_df$Date)
# Fonds- und Benchmarkdaten verschmelzen
X0_merged <- merge(sorted_funds, X0_df, by = "Date", all.x = TRUE)
# Berechnung der Koeffizienten via OLS Regression
results <- X0_merged %>%
group_by(FundId) %>%
do(tidy(lm(Gross_Excess_Return ~ V1 + index_1 + index_2 + index_3 + index_4 + index_5 +
index_6 + index_7 + index_8 + index_9 + index_10 + index_11 - 1 , data = .))) %>%
ungroup()
# Bau des Koeffizienten Dataframes
fund_coefficients <- results %>%
pivot_wider(names_from = term, values_from = estimate, id_cols = FundId) %>%
rename_with(~ gsub("^\\(Intercept\\)$", "V1", .x), .cols = everything()) %>%
dplyr::select(FundId, V1, starts_with("index_"))
# #Verfarhen zur Berechnung der Koeffizienten
# fit_model <- function(returns) {
# data_for_model <- data.frame(fund_return = returns, X0)
# model <- lm(fund_return ~ . - 1, data = data_for_model)
# return(coefficients(model))
# }
#
# #Berechnung der Koeffizienten
# models_list <- lapply(fund_return_list, fit_model)
#
# #Bau des Koeffizienten Dataframes
# fund_coefficients <- data.frame(do.call(rbind, models_list))
# fund_coefficients$FundId <- names(models_list)
# colnames(fund_coefficients)[1] <- "intercept"
# new_names <- paste("index", 1:11, sep="_")
# colnames(fund_coefficients)[2:12] <- new_names
###### Benchmark-Mapping ######
#Umwandlung der Matrix X0_hat in einen Dataframe
X0_hat_data <- data.frame(X0_hat)
X0_hat_data <- X0_hat_data[,2:12]
X0_hat_data$Date <- funds$Date
X0_hat_data$Date <- as.Date(X0_hat_data$Date)
#Zuordnung orthogonalisierte Benchmark Returns und Regressionskoeffizienten je Beobachtung (FundId - Date)
fund_benchmark_returns <- md_funds[,1:2]
fund_benchmark_returns <- fund_benchmark_returns %>%
left_join(fund_coefficients, by = "FundId")
fund_benchmark_returns$intercept <- NULL
fund_benchmark_returns <- merge(X0_hat_data, fund_benchmark_returns, by = "Date", all = TRUE)
fund_benchmark_returns <- fund_benchmark_returns[,c(13,1,2:12,15:25)]
#Berechnung der spezifischen Benchmark Returns (Koeffizienten * orthogonalisierte Benchmark Returns) je Beobachtung
index_x_cols <- grep("index_[0-9]+\\.x", names(fund_benchmark_returns), value = TRUE)
index_y_cols <- gsub("\\.x$", ".y", index_x_cols)
new_index_cols <- gsub("\\.x$", "", gsub("^index_", "", index_x_cols))
for (i in seq_along(index_x_cols)) {
x_col <- index_x_cols[i]
y_col <- index_y_cols[i]
new_col <- new_index_cols[i]
if (y_col %in% names(fund_benchmark_returns)) {
fund_benchmark_returns[[new_col]] <- fund_benchmark_returns[[x_col]] * fund_benchmark_returns[[y_col]]
}
}
fund_benchmark_returns <- fund_benchmark_returns[,c(2,1, 25:35)]
fund_benchmark_returns <- fund_benchmark_returns %>%
rename_with(~ paste("index", 1:11, sep = "_"), .cols = 3:13)
fund_benchmark_returns$benchmark_return <- rowSums(fund_benchmark_returns[, 3:13], na.rm = TRUE)
###### Benchmark-Mapping ######
#Zuordnung von Benchmark Returns zu Gross Excess Returns
adjusted_gross_alpha <- left_join(fund_benchmark_returns, sorted_funds[, c("FundId", "Date", "Gross_Excess_Return")],
by = c("FundId", "Date"))
adjusted_gross_alpha <- adjusted_gross_alpha[,c(1,2,15,14)]
#Berechnung der Gross Alphas (Gross Excess Return minus der Summe der Benchmark Returns)
adjusted_gross_alpha <- adjusted_gross_alpha %>%
filter(!is.na(Gross_Excess_Return))
adjusted_gross_alpha$Gross_Alpha <- adjusted_gross_alpha$Gross_Excess_Return - adjusted_gross_alpha$benchmark_return
adjusted_gross_alpha <- adjusted_gross_alpha[,c(1:2,5,3,4)]
The issue with the value-weighting calculation of Gross_Alpha
in the provided code is likely due to the fact that the code does not explicitly perform a value-weighted average calculation based on the number of observations per FundId
. Instead, the code calculates Gross_Alpha
by subtracting the benchmark return from the gross excess return for each FundId
and Date
combination. This results in an equal-weighted measure because it treats each FundId
equally, regardless of the number of observations or the size of the fund.
To calculate a value-weighted Gross_Alpha
, you would need to:
FundId
based on the number of observations or the size of the fund.Gross_Alpha
of each FundId
by its corresponding weight.Gross_Alpha
values across all FundId
s.FundId
s to get the value-weighted average Gross_Alpha
.The code does not include these steps, which is why the value-weighting calculation is incorrect. To correct this, you would need to add the appropriate weighting logic to the calculation of Gross_Alpha
. Here is an example of how you might implement the value-weighting calculation:
# Assuming 'number_of_observations' is a column that contains the number of observations per FundId
# and 'mean_gross_alpha' is a column that contains the mean Gross_Alpha per FundId
# Calculate the weight for each FundId
adjusted_gross_alpha <- adjusted_gross_alpha %>%
group_by(FundId) %>%
mutate(number_of_observations = n(),
mean_gross_alpha = mean(Gross_Alpha, na.rm = TRUE)) %>%
ungroup()
# Calculate the total sum of weights across all FundIds
total_weight <- sum(adjusted_gross_alpha$number_of_observations, na.rm = TRUE)
# Calculate the value-weighted Gross_Alpha
adjusted_gross_alpha <- adjusted_gross_alpha %>%
mutate(weighted_gross_alpha = (number_of_observations * mean_gross_alpha) / total_weight)
# Sum the weighted Gross_Alpha values to get the value-weighted average Gross_Alpha
value_weighted_gross_alpha <- sum(adjusted_gross_alpha$weighted_gross_alpha, na.rm = TRUE)
This code snippet adds the necessary steps to calculate the value-weighted average Gross_Alpha
. It first calculates the weight for each FundId
based on the number of observations, then calculates the weighted Gross_Alpha
for each FundId
, and finally sums these values to get the overall value-weighted average.
@autopilot implement a panel regression approach for the following regression. Keep the fixed effects and the clustering: