LilMichelangelo2 / Amanta

0 stars 0 forks source link

test #6

Open LilMichelangelo2 opened 2 months ago

LilMichelangelo2 commented 2 months ago

Hey @autopilot, create an empty matrix in R

codeautopilot[bot] commented 2 months ago

🤖 I'm working on a solution for this issue. Please don't create new issues or edit this one until I reply back.

LilMichelangelo2 commented 2 months ago

@autopilot i want the fixed effects for both Fund_Family and Date_Prospectus_Objective

expected_regression <- feols(Transformed_estimated_value_ddded ~ L1_team + IMR + L1_Fund_Age + L1_Expense_Ratio

codeautopilot[bot] commented 2 months ago

@autopilot i want the fixed effects for both Fund_Family and Date_Prospectus_Objective

expected_regression <- feols(Transformed_estimated_value_ddded ~ L1_team + IMR + L1_Fund_Age + L1_Expense_Ratio

To extract the fixed effects for both Fund_Family and Date_Prospectus_Objective from your feols regression model in R, you can use the fixef function from the {fixest} package, which is likely the package you're using given the syntax of your feols function. Here's how you can do it:

# Assuming your model is already fitted and named expected_regression
fixed_effects <- fixef(expected_regression)

# The fixef function returns a list of fixed effects, you can access the fixed effects for Fund_Family and Date_Prospectus_Objective like this:
fund_family_effects <- fixed_effects$Fund_Family
date_prospectus_objective_effects <- fixed_effects$Date_Prospectus_Objective

# To view the fixed effects
print(fund_family_effects)
print(date_prospectus_objective_effects)

If you want to save these fixed effects to a CSV file, you can use the write.csv function:

write.csv(fund_family_effects, "fund_family_effects.csv")
write.csv(date_prospectus_objective_effects, "date_prospectus_objective_effects.csv")

This will create two CSV files with the fixed effects for Fund_Family and Date_Prospectus_Objective, respectively.

LilMichelangelo2 commented 2 months ago

@autopilot correct and improve: implement Fund_Family fixed effects and Date x Prospectus..Objective interaction fixed effects

expected_regression <- feols(Transformed_estimated_value_ddded ~ L1_team + IMR + L1_Fund_Age + L1_Expense_Ratio
+ L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow
+ L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
| Fund_Family + Date_Prospectus_Objective,
data = expected_regression_data,
cluster = c("FundId", "Date"))
codeautopilot[bot] commented 2 months ago

@autopilot correct and improve: implement Fund_Family fixed effects and Date x Prospectus..Objective interaction fixed effects

expected_regression <- feols(Transformed_estimated_value_ddded ~ L1_team + IMR + L1_Fund_Age + L1_Expense_Ratio
+ L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow
+ L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
| Fund_Family + Date_Prospectus_Objective,
data = expected_regression_data,
cluster = c("FundId", "Date"))

To implement fixed effects for Fund_Family and an interaction between Date and Prospectus_Objective, you need to modify the fixed effects part of your feols function call. You can specify interaction terms using the : operator in R. Here's the corrected and improved code:

expected_regression <- feols(Transformed_estimated_value_ddded ~ L1_team + IMR + L1_Fund_Age + L1_Expense_Ratio
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             | Fund_Family + Date:Prospectus_Objective,
                             data = expected_regression_data,
                             cluster = c("FundId", "Date"))

In this code, Fund_Family is included as a fixed effect, and the interaction between Date and Prospectus_Objective is also included as a fixed effect. This will estimate separate fixed effects for each combination of Date and Prospectus_Objective.

LilMichelangelo2 commented 2 months ago

@autopilot add two fund fixed effects, A and B

expected_regression <- feols(Transformed_estimated_value_ddded ~ L1_team + IMR + L1_Fund_Age + L1_Expense_Ratio

codeautopilot[bot] commented 2 months ago

@autopilot add two fund fixed effects, A and B

expected_regression <- feols(Transformed_estimated_value_ddded ~ L1_team + IMR + L1_Fund_Age + L1_Expense_Ratio

To add two fund fixed effects, A and B, to your feols regression model, you can include them in the fixed effects part of the formula using the + operator. Here's how you can modify your code:

expected_regression <- feols(Transformed_estimated_value_ddded ~ L1_team + IMR + L1_Fund_Age + L1_Expense_Ratio
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             | Fund_Family + Date:Prospectus_Objective + A + B,
                             data = expected_regression_data,
                             cluster = c("FundId", "Date"))

In this updated code, A and B are added as additional fixed effects. Make sure that A and B are columns in your expected_regression_data dataframe that represent the fund fixed effects you want to control for.

LilMichelangelo2 commented 2 months ago

@autopilot include an intercept


expected_regression <- feols(Transformed_estimated_value_ddded ~ L1_team + IMR + L1_Fund_Age + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             | Fund_Family + TIMExOBJECTIVE,
                             data = expected_regression_data,
                             cluster = c("FundId", "Date"), se = "cluster")
codeautopilot[bot] commented 2 months ago

@autopilot include an intercept

expected_regression <- feols(Transformed_estimated_value_ddded ~ L1_team + IMR + L1_Fund_Age + L1_Expense_Ratio


In the `feols` function from the `{fixest}` package, an intercept is included by default in the regression model. The formula you provided already includes an intercept. If you want to explicitly show the intercept in the formula, you can add a `1` at the beginning of the formula, but it is not necessary. Your model is correctly specified with an intercept as it stands. If you wanted to exclude the intercept, you would add a `-1` or `0` to the formula.

Here is your model with the intercept explicitly included, which is optional:

```r
expected_regression <- feols(Transformed_estimated_value_ddded ~ 1 + L1_team + IMR + L1_Fund_Age + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             | Fund_Family + TIMExOBJECTIVE,
                             data = expected_regression_data,
                             cluster = c("FundId", "Date"), se = "cluster")

However, as mentioned, the 1 is not necessary because the intercept is assumed by default.

LilMichelangelo2 commented 2 months ago

@autopilot in R: for the date variable md$Inception, turn it to the last day of the last month, if the current day < 15, turn it to the last day of the current month if day >= 15

codeautopilot[bot] commented 2 months ago

@autopilot in R: for the date variable md$Inception, turn it to the last day of the last month, if the current day < 15, turn it to the last day of the current month if day >= 15

To adjust the md$Inception date variable in R according to your specifications, you can use the lubridate package, which makes date-time manipulation easier. Here's how you can do it:

library(lubridate)

# Function to adjust the date according to the day of the month
adjust_date <- function(date) {
  if (day(date) < 15) {
    # If the day is less than 15, set to the last day of the previous month
    return(ceiling_date(date, "month") - days(1) - months(1))
  } else {
    # If the day is 15 or more, set to the last day of the current month
    return(ceiling_date(date, "month") - days(1))
  }
}

# Apply the function to the Inception column
md$Inception <- sapply(md$Inception, adjust_date)

This code defines a function adjust_date that takes a date and checks the day of the month. If the day is less than 15, it returns the last day of the previous month. If the day is 15 or greater, it returns the last day of the current month. The sapply function is then used to apply this adjustment to each date in the md$Inception column.

LilMichelangelo2 commented 2 months ago

@autopilot ignore NA data:


library(lubridate)

# Function to adjust the date according to the day of the month
adjust_date <- function(date) {
  if (day(date) < 15) {
    # If the day is less than 15, set to the last day of the previous month
    return(ceiling_date(date, "month") - days(1) - months(1))
  } else {
    # If the day is 15 or more, set to the last day of the current month
    return(ceiling_date(date, "month") - days(1))
  }
}

# Apply the function to the Inception column
md$Inception <- sapply(md$Inception, adjust_date)
codeautopilot[bot] commented 2 months ago

@autopilot ignore NA data:

library(lubridate)

Function to adjust the date according to the day of the month

adjust_date <- function(date) { if (day(date) < 15) {

If the day is less than 15, set to the last day of the previous month

return(ceiling_date(date, "month") - days(1) - months(1))

} else {

If the day is 15 or more, set to the last day of the current month

return(ceiling_date(date, "month") - days(1))

} }

Apply the function to the Inception column

md$Inception <- sapply(md$Inception, adjust_date)


To modify the `adjust_date` function to ignore `NA` values, you can add a condition to check for `NA` before performing any operations. Here's the updated code:

```r
library(lubridate)

# Function to adjust the date according to the day of the month, ignoring NA values
adjust_date <- function(date) {
  if (is.na(date)) {
    # Return NA without any changes
    return(NA)
  } else if (day(date) < 15) {
    # If the day is less than 15, set to the last day of the previous month
    return(ceiling_date(date, "month") - days(1) - months(1))
  } else {
    # If the day is 15 or more, set to the last day of the current month
    return(ceiling_date(date, "month") - days(1))
  }
}

# Apply the function to the Inception column, ignoring NA values
md$Inception <- sapply(md$Inception, adjust_date)

This code will now correctly handle NA values by returning them as-is without attempting to adjust the date.

LilMichelangelo2 commented 2 months ago

@autopilot is there a more efficient implementation?


adjust_date <- function(date) {
  if (is.na(date)) {
    return(NA)
  } else if (day(date) < 15) {
    return(ceiling_date(date, "month") - days(1) - months(1))
  } else {
    return(ceiling_date(date, "month") - days(1))
  }
}

md$Inception <- sapply(md$Inception, adjust_date)
codeautopilot[bot] commented 2 months ago

@autopilot is there a more efficient implementation?

adjust_date <- function(date) { if (is.na(date)) { return(NA) } else if (day(date) < 15) { return(ceiling_date(date, "month") - days(1) - months(1)) } else { return(ceiling_date(date, "month") - days(1)) } }

md$Inception <- sapply(md$Inception, adjust_date)


Yes, there is a more efficient way to implement this by vectorizing the function using `ifelse` and `lubridate` functions, which avoids the need for `sapply`. Here's an improved version:

```r
library(lubridate)

# Vectorized function to adjust the date according to the day of the month
adjust_date_vectorized <- function(dates) {
  last_day_prev_month <- ceiling_date(dates, "month") - days(1) - months(1)
  last_day_curr_month <- ceiling_date(dates, "month") - days(1)

  # Use ifelse to vectorize the conditional operation
  adjusted_dates <- ifelse(day(dates) < 15, last_day_prev_month, last_day_curr_month)

  # Handle NA values by replacing adjusted NA dates with the original NA values
  adjusted_dates[is.na(dates)] <- NA

  return(adjusted_dates)
}

# Apply the vectorized function to the Inception column
md$Inception <- adjust_date_vectorized(md$Inception)

This implementation should be faster, especially for larger datasets, because it avoids the loop-like behavior of sapply and operates on the entire vector of dates at once.

LilMichelangelo2 commented 2 months ago

@autopilot i want to have the month difference between two dates

md$Fund_Age <- difftime(md$Date, md$Inception, units = "months") / 30.44 
md$Fund_Age <- round(md$Fund_Age, digits = 0)  
codeautopilot[bot] commented 2 months ago

@autopilot i want to have the month difference between two dates

md$Fund_Age <- difftime(md$Date, md$Inception, units = "months") / 30.44 
md$Fund_Age <- round(md$Fund_Age, digits = 0)  

To calculate the month difference between two dates in R, you can use the lubridate package, which provides a more accurate way to handle date differences in terms of months. Here's how you can do it:

library(lubridate)

# Calculate the difference in months
md$Fund_Age <- interval(md$Inception, md$Date) / months(1)

# Round the result to the nearest whole number
md$Fund_Age <- round(md$Fund_Age, digits = 0)

This code uses the interval function to create a time interval between the Inception and Date columns, and then divides this interval by months(1) to get the difference in months. The result is then rounded to the nearest whole number. This method does not rely on an approximate number of days per month and should provide a more accurate result.

LilMichelangelo2 commented 2 months ago

@autopilot throw out all observations before md$Date >= md$Inception

md$Inception..Date <- as.Date(md$Inception..Date)
md$Inception <- as.Date(md$Inception..Date)

md$Fund_Age <- interval(md$Inception, md$Date) / months(1)
md$Fund_Age <- round(md$Fund_Age, digits = 0)
codeautopilot[bot] commented 2 months ago

@autopilot throw out all observations before md$Date >= md$Inception

md$Inception..Date <- as.Date(md$Inception..Date)
md$Inception <- as.Date(md$Inception..Date)

md$Fund_Age <- interval(md$Inception, md$Date) / months(1)
md$Fund_Age <- round(md$Fund_Age, digits = 0)

To filter out all observations where md$Date is before md$Inception, you can use the subset function or the dplyr package. Here's how you can do it with base R:

# Convert to Date class if not already
md$Inception <- as.Date(md$Inception)
md$Date <- as.Date(md$Date)

# Filter out observations where Date is before Inception
md <- subset(md, Date >= Inception)

And here's how you can do it using dplyr:

library(dplyr)

# Ensure columns are of Date class
md <- md %>%
  mutate(Inception = as.Date(Inception),
         Date = as.Date(Date))

# Filter out observations where Date is before Inception
md <- md %>%
  filter(Date >= Inception)

Both of these methods will remove rows from md where md$Date is less than md$Inception.

LilMichelangelo2 commented 2 months ago

@autopilot why cant i get estimates for the intercept?

expected_regression <- feols(Transformed_estimated_value_ddded ~ 1 + L1_team + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female 
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = expected_regression_data, cluster = c("FundId", "Date"))
codeautopilot[bot] commented 2 months ago

@autopilot why cant i get estimates for the intercept?

expected_regression <- feols(Transformed_estimated_value_ddded ~ 1 + L1_team + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female 
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = expected_regression_data, cluster = c("FundId", "Date"))

If you are not getting estimates for the intercept in your feols regression output, it could be due to a few reasons:

  1. Fixed Effects Absorbing the Intercept: When you include fixed effects in your model, such as Fund_Family and TIMExOBJECTIVE, the intercept can be absorbed by these fixed effects. Each fixed effect can be thought of as having its own intercept, which can lead to the overall intercept being omitted from the output.

  2. Model Specification: If the fixed effects in your model account for all the variation that would otherwise be captured by the intercept, the intercept may not be identifiable. This is particularly the case in models with a large number of fixed effects relative to the number of observations.

  3. Software Output Settings: Some statistical software packages or functions do not display the intercept by default when fixed effects are included, especially if the intercept is not of primary interest or is not interpretable due to the inclusion of fixed effects.

  4. Data Issues: There could be issues with the data that are causing problems with the estimation of the intercept. For example, if there is perfect multicollinearity or if some variables have very little variation, this could affect the estimation.

To troubleshoot this issue, you can try the following:

If you need further assistance, please provide additional details about the output you are receiving and any error messages or warnings that may be occurring.

LilMichelangelo2 commented 2 months ago

@autopilot in R, add three dummy variables to md_data: small_cap = 1 when in the lowest F1_Fund_Size tercile, otherwise 0. mid_cap = 1 when in the second F1_Fund_Size tercile, otherwise 0. large_cap = 1 when in the highest F1_Fund_Size tercile, otherwise 0.

codeautopilot[bot] commented 2 months ago

@autopilot in R, add three dummy variables to md_data: small_cap = 1 when in the lowest F1_Fund_Size tercile, otherwise 0. mid_cap = 1 when in the second F1_Fund_Size tercile, otherwise 0. large_cap = 1 when in the highest F1_Fund_Size tercile, otherwise 0.

To create three dummy variables (small_cap, mid_cap, large_cap) based on the terciles of the F1_Fund_Size variable in the md_data dataframe in R, you can use the quantile function to find the tercile thresholds and then use logical comparisons to create the dummy variables. Here's how you can do it:

# Calculate the terciles for F1_Fund_Size
terciles <- quantile(md_data$F1_Fund_Size, probs = c(1/3, 2/3), na.rm = TRUE)

# Create dummy variables based on the terciles
md_data$small_cap <- as.integer(md_data$F1_Fund_Size <= terciles[1])
md_data$mid_cap <- as.integer(md_data$F1_Fund_Size > terciles[1] & md_data$F1_Fund_Size <= terciles[2])
md_data$large_cap <- as.integer(md_data$F1_Fund_Size > terciles[2])

# View the first few rows of the dataframe to confirm the dummy variables are added
head(md_data)

This code will add the small_cap, mid_cap, and large_cap columns to the md_data dataframe, with values of 1 or 0 according to the tercile in which each observation's F1_Fund_Size falls. The as.integer function is used to ensure that the logical comparisons result in integer values (1 or 0) rather than logical values (TRUE or FALSE).

LilMichelangelo2 commented 2 months ago

@autopilot i want to check the effect of the L1_team dummy for different fund sizes (ideally, per size tercile). Implement the interaction term in R:


realised_regression <- feols(Transformed_Value_Added ~ L1_team + small_cap + mid_cap + large_cap 
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))
codeautopilot[bot] commented 2 months ago

@autopilot i want to check the effect of the L1_team dummy for different fund sizes (ideally, per size tercile). Implement the interaction term in R:

realised_regression <- feols(Transformed_Value_Added ~ L1_team + small_cap + mid_cap + large_cap

To check the effect of the L1_team dummy for different fund sizes, you can include interaction terms between L1_team and the fund size dummy variables (small_cap, mid_cap, large_cap) in your feols regression model. Here's how you can implement the interaction terms in R:

realised_regression <- feols(Transformed_Value_Added ~ L1_team * (small_cap + mid_cap + large_cap)
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))

In this code, the * operator is used to create interaction terms between L1_team and each of the fund size dummy variables. This will allow you to estimate the effect of L1_team separately for each fund size tercile. The output of the regression will include coefficients for L1_team, each of the size dummies, and the interaction terms, which represent the differential effect of L1_team for each fund size category.

LilMichelangelo2 commented 2 months ago

@autopilot again, i want to check the effect of the L1_team dummy for different fund sizes (ideally, per size tercile or quintile). I.e., i want to be able to compare the L1_team coefficients for different size categories. improve and correct my code:

md_1992 <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/MS/md_1992.csv")
md_1992$Date <- as.Date(md_1992$Date)
md_1992 <- md_1992[,2:ncol(md_1992)]

md_obj <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/MS/md_clean.csv")
md_obj$Date <- as.Date(md_obj$Date)

md_obj_unique <- md_obj %>%
  group_by(FundId, Date) %>%
  summarise(Prospectus..Objective = first(Prospectus..Objective), .groups = 'drop')

IMR <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/IMR.csv")
IMR <- IMR[,2:ncol(IMR)]
IMR$Date <- as.Date(IMR$Date)

md_data <- md_1992 %>%
  dplyr::select(
    FundId, Date, team, L1_team, L1_Family_Size, L1_Fund_Size, L1_Raw_Return_Volatility,
    L1_Fund_Age, L1_Net_Fund_Flow, L1_Turnover_Ratio, L1_Expense_Ratio, 
    L1_Equity_Holdings, L1_Cash_Holdings, L1_Tenure, L1_Female, Fund_Family
  )

md_data$L1_Raw_Return_Volatility <- md_data$L1_Raw_Return_Volatility*100
md_data$L1_Net_Fund_Flow <- md_data$L1_Net_Fund_Flow*100
md_data$L1_Expense_Ratio <- md_data$L1_Expense_Ratio*100
md_data$L1_Equity_Holdings <- md_data$L1_Equity_Holdings*100
md_data$L1_Cash_Holdings <- md_data$L1_Cash_Holdings*100
md_data$L1_Turnover_Ratio <- md_data$L1_Turnover_Ratio*100
md_data$L1_Female <- md_data$L1_Female*100 

md_data <- md_data %>%
  left_join(md_obj_unique, by = c("FundId", "Date"))

md_data <- md_data %>%
  mutate(TIMExOBJECTIVE = interaction(Date, Prospectus..Objective))

md_data$L1_Fund_Age <- md_data$L1_Fund_Age + 1

md_data$L1_Tenure <- md_data$L1_Tenure / 12
md_data$L1_Tenure <- round(md_data$L1_Tenure, digits = 0)
md_data$L1_Tenure <- md_data$L1_Tenure

md_data <- md_data %>%
  mutate(
    small_cap = ifelse(ntile(L1_Fund_Size, 3) == 1, 1, 0),
    mid_cap = ifelse(ntile(L1_Fund_Size, 3) == 2, 1, 0),
    large_cap = ifelse(ntile(L1_Fund_Size, 3) == 3, 1, 0)
  )

########### R E A L I S E D ########### 

realised_value_added <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/realised_value_added.csv")
realised_value_added <- realised_value_added[,2:ncol(realised_value_added)]
realised_value_added <- realised_value_added[,c(1,2,6)]
realised_value_added$Date <- as.Date(realised_value_added$Date)

realised_regression_data <- realised_value_added %>%
  left_join(md_data, by = c("FundId", "Date"))

realised_regression_data <- realised_regression_data %>%
  left_join(IMR, by = c("FundId", "Date"))

realised_regression_data <- na.omit(realised_regression_data)

signed_log <- function(x) {
  sign(x) * log1p(abs(x))
}

realised_regression_data$Transformed_Value_Added <- signed_log(realised_regression_data$Value_Added)

realised_regression <- feols(Transformed_Value_Added ~ L1_team * (small_cap + mid_cap + large_cap)
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))

summary(realised_regression)
codeautopilot[bot] commented 2 months ago

@autopilot again, i want to check the effect of the L1_team dummy for different fund sizes (ideally, per size tercile or quintile). I.e., i want to be able to compare the L1_team coefficients for different size categories. improve and correct my code:

md_1992 <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/MS/md_1992.csv")
md_1992$Date <- as.Date(md_1992$Date)
md_1992 <- md_1992[,2:ncol(md_1992)]

md_obj <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/MS/md_clean.csv")
md_obj$Date <- as.Date(md_obj$Date)

md_obj_unique <- md_obj %>%
  group_by(FundId, Date) %>%
  summarise(Prospectus..Objective = first(Prospectus..Objective), .groups = 'drop')

IMR <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/IMR.csv")
IMR <- IMR[,2:ncol(IMR)]
IMR$Date <- as.Date(IMR$Date)

md_data <- md_1992 %>%
  dplyr::select(
    FundId, Date, team, L1_team, L1_Family_Size, L1_Fund_Size, L1_Raw_Return_Volatility,
    L1_Fund_Age, L1_Net_Fund_Flow, L1_Turnover_Ratio, L1_Expense_Ratio, 
    L1_Equity_Holdings, L1_Cash_Holdings, L1_Tenure, L1_Female, Fund_Family
  )

md_data$L1_Raw_Return_Volatility <- md_data$L1_Raw_Return_Volatility*100
md_data$L1_Net_Fund_Flow <- md_data$L1_Net_Fund_Flow*100
md_data$L1_Expense_Ratio <- md_data$L1_Expense_Ratio*100
md_data$L1_Equity_Holdings <- md_data$L1_Equity_Holdings*100
md_data$L1_Cash_Holdings <- md_data$L1_Cash_Holdings*100
md_data$L1_Turnover_Ratio <- md_data$L1_Turnover_Ratio*100
md_data$L1_Female <- md_data$L1_Female*100 

md_data <- md_data %>%
  left_join(md_obj_unique, by = c("FundId", "Date"))

md_data <- md_data %>%
  mutate(TIMExOBJECTIVE = interaction(Date, Prospectus..Objective))

md_data$L1_Fund_Age <- md_data$L1_Fund_Age + 1

md_data$L1_Tenure <- md_data$L1_Tenure / 12
md_data$L1_Tenure <- round(md_data$L1_Tenure, digits = 0)
md_data$L1_Tenure <- md_data$L1_Tenure

md_data <- md_data %>%
  mutate(
    small_cap = ifelse(ntile(L1_Fund_Size, 3) == 1, 1, 0),
    mid_cap = ifelse(ntile(L1_Fund_Size, 3) == 2, 1, 0),
    large_cap = ifelse(ntile(L1_Fund_Size, 3) == 3, 1, 0)
  )

########### R E A L I S E D ########### 

realised_value_added <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/realised_value_added.csv")
realised_value_added <- realised_value_added[,2:ncol(realised_value_added)]
realised_value_added <- realised_value_added[,c(1,2,6)]
realised_value_added$Date <- as.Date(realised_value_added$Date)

realised_regression_data <- realised_value_added %>%
  left_join(md_data, by = c("FundId", "Date"))

realised_regression_data <- realised_regression_data %>%
  left_join(IMR, by = c("FundId", "Date"))

realised_regression_data <- na.omit(realised_regression_data)

signed_log <- function(x) {
  sign(x) * log1p(abs(x))
}

realised_regression_data$Transformed_Value_Added <- signed_log(realised_regression_data$Value_Added)

realised_regression <- feols(Transformed_Value_Added ~ L1_team * (small_cap + mid_cap + large_cap)
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))

summary(realised_regression)

Your code for checking the effect of the L1_team dummy for different fund sizes using interaction terms is mostly correct. However, I noticed that you are using ntile to create the small_cap, mid_cap, and large_cap variables, which is fine, but you should ensure that ntile is being applied within the correct grouping if your data is panel data with multiple observations per fund over time.

Here's an improved version of your code with comments on the changes:

library(dplyr)
library(fixest)

# Read and prepare the data
md_1992 <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/MS/md_1992.csv")
md_1992$Date <- as.Date(md_1992$Date)
md_1992 <- md_1992[, -1] # Assuming the first column is to be removed

md_obj <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/MS/md_clean.csv")
md_obj$Date <- as.Date(md_obj$Date)

md_obj_unique <- md_obj %>%
  group_by(FundId, Date) %>%
  summarise(Prospectus..Objective = first(Prospectus..Objective), .groups = 'drop')

IMR <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/IMR.csv")
IMR <- IMR[, -1] # Assuming the first column is to be removed
IMR$Date <- as.Date(IMR$Date)

md_data <- md_1992 %>%
  select(-team, -Fund_Family) %>% # Remove 'team' and 'Fund_Family' if not needed
  mutate(across(starts_with("L1_"), ~ . * 100), # Apply the multiplication to all L1_ variables
         L1_Tenure = round(L1_Tenure / 12, 0)) %>%
  left_join(md_obj_unique, by = c("FundId", "Date")) %>%
  mutate(TIMExOBJECTIVE = interaction(Date, Prospectus..Objective),
         L1_Fund_Age = L1_Fund_Age + 1) %>%
  group_by(FundId) %>% # Ensure ntile is applied within each FundId
  mutate(small_cap = as.integer(ntile(L1_Fund_Size, 3) == 1),
         mid_cap = as.integer(ntile(L1_Fund_Size, 3) == 2),
         large_cap = as.integer(ntile(L1_Fund_Size, 3) == 3)) %>%
  ungroup()

# Prepare realised regression data
realised_value_added <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/realised_value_added.csv")
realised_value_added <- realised_value_added[, c(1, 2, 6)]
realised_value_added$Date <- as.Date(realised_value_added$Date)

realised_regression_data <- realised_value_added %>%
  left_join(md_data, by = c("FundId", "Date")) %>%
  left_join(IMR, by = c("FundId", "Date")) %>%
  na.omit()

# Define the signed log transformation
signed_log <- function(x) {
  sign(x) * log1p(abs(x))
}

realised_regression_data$Transformed_Value_Added <- signed_log(realised_regression_data$Value_Added)

# Run the regression with interaction terms
realised_regression <- feols(Transformed_Value_Added ~ L1_team * (small_cap + mid_cap + large_cap)
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))

# View the summary of the regression
summary(realised_regression)

Make sure to replace the column indices and names according to your actual dataset structure. The above code assumes that the first column in your CSV files is an index column that should be removed. Adjust the select and mutate functions as needed for your specific dataset.

LilMichelangelo2 commented 2 months ago

@autopilot


> realised_regression <- feols(Transformed_Value_Added ~ L1_team * (small_cap + mid_cap + large_cap)
+                              + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
+                              + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
+                              + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
+                              + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
+                              data = realised_regression_data, cluster = c("FundId", "Date"))
The variable 'L1_team:large_cap' has been removed because of collinearity (see $collin.var).
codeautopilot[bot] commented 2 months ago

@autopilot


realised_regression <- feols(Transformed_Value_Added ~ L1_team * (small_cap + mid_cap + large_cap)
+                              + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
+                              + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
+                              + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
+                              + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
+                              data = realised_regression_data, cluster = c("FundId", "Date"))
The variable 'L1_team:large_cap' has been removed because of collinearity (see $collin.var).

The message you're seeing indicates that the interaction term L1_team:large_cap has been removed from your regression model due to collinearity. Collinearity occurs when two or more predictor variables in a regression model are highly correlated, meaning that they contain similar information about the variance of the dependent variable.

In the context of your model, it's possible that the L1_team:large_cap interaction term is highly correlated with one or more of the other variables in the model, which can lead to unreliable estimates of the regression coefficients. The {fixest} package automatically detects such collinearity issues and removes the problematic variables to ensure the model is estimable.

To address this issue, you can:

  1. Check for Collinearity: Use the $collin.var attribute of your model object to see which variables are causing collinearity. This can help you understand the relationships between your variables.

  2. Reconsider Model Specification: Think about whether all the interaction terms are necessary or if the model can be simplified. Sometimes, collinearity arises because the model is over-specified.

  3. Centering Variables: Centering the variables involved in the interaction (subtracting the mean) can sometimes help reduce collinearity.

  4. Variance Inflation Factor (VIF): Calculate the VIF for your variables to quantify the severity of collinearity. Variables with a high VIF may need to be removed or modified.

  5. Examine Data: Ensure that the dummy variables for fund size are correctly specified and that there is enough variation across the L1_team and large_cap categories.

  6. Alternative Approaches: If the interaction term is essential for your analysis, consider using other methods such as stratified models where you run separate regressions for each fund size category.

Here's an example of how you can check for collinearity using the collin.var attribute:

# Run the regression
realised_regression <- feols(Transformed_Value_Added ~ L1_team * (small_cap + mid_cap + large_cap)
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))

# Check for collinearity
print(realised_regression$collin.var)

This will print out the variables that were detected to be collinear. Use this information to make informed decisions about how to adjust your model.

LilMichelangelo2 commented 2 months ago

@autopilot i cant use the cap dummies because of collinearity (large_cap). How would you proceed? I want to know how the size of the fund affects the coefficient of L1_Team:

md_data <- md_data %>%
  mutate(small_cap = as.integer(ntile(L1_Fund_Size, 3) == 1),
         mid_cap = as.integer(ntile(L1_Fund_Size, 3) == 2),
         large_cap = as.integer(ntile(L1_Fund_Size, 3) == 3))

########### R E A L I S E D ########### 

realised_value_added <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/realised_value_added.csv")
realised_value_added <- realised_value_added[,2:ncol(realised_value_added)]
realised_value_added <- realised_value_added[,c(1,2,6)]
realised_value_added$Date <- as.Date(realised_value_added$Date)

realised_regression_data <- realised_value_added %>%
  left_join(md_data, by = c("FundId", "Date"))

realised_regression_data <- realised_regression_data %>%
  left_join(IMR, by = c("FundId", "Date"))

realised_regression_data <- na.omit(realised_regression_data)

signed_log <- function(x) {
  sign(x) * log1p(abs(x))
}

realised_regression_data$Transformed_Value_Added <- signed_log(realised_regression_data$Value_Added)

realised_regression <- feols(Transformed_Value_Added ~ L1_team * (small_cap + mid_cap + large_cap)
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))

summary(realised_regression)
codeautopilot[bot] commented 2 months ago

@autopilot i cant use the cap dummies because of collinearity (large_cap). How would you proceed? I want to know how the size of the fund affects the coefficient of L1_Team:

md_data <- md_data %>%
  mutate(small_cap = as.integer(ntile(L1_Fund_Size, 3) == 1),
         mid_cap = as.integer(ntile(L1_Fund_Size, 3) == 2),
         large_cap = as.integer(ntile(L1_Fund_Size, 3) == 3))

########### R E A L I S E D ########### 

realised_value_added <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/realised_value_added.csv")
realised_value_added <- realised_value_added[,2:ncol(realised_value_added)]
realised_value_added <- realised_value_added[,c(1,2,6)]
realised_value_added$Date <- as.Date(realised_value_added$Date)

realised_regression_data <- realised_value_added %>%
  left_join(md_data, by = c("FundId", "Date"))

realised_regression_data <- realised_regression_data %>%
  left_join(IMR, by = c("FundId", "Date"))

realised_regression_data <- na.omit(realised_regression_data)

signed_log <- function(x) {
  sign(x) * log1p(abs(x))
}

realised_regression_data$Transformed_Value_Added <- signed_log(realised_regression_data$Value_Added)

realised_regression <- feols(Transformed_Value_Added ~ L1_team * (small_cap + mid_cap + large_cap)
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))

summary(realised_regression)

If you're encountering collinearity issues with the large_cap dummy variable, one approach is to omit the large_cap dummy and use only the small_cap and mid_cap dummies in your regression model. By excluding large_cap, you avoid perfect multicollinearity since the omitted category (in this case, large-cap funds) will serve as the reference group against which the other categories are compared.

Here's how you can modify your regression model to proceed without the large_cap dummy:

# Adjust the md_data dataframe to include only small_cap and mid_cap dummies
md_data <- md_data %>%
  mutate(small_cap = as.integer(ntile(L1_Fund_Size, 3) == 1),
         mid_cap = as.integer(ntile(L1_Fund_Size, 3) == 2))
# Note: large_cap is omitted and will be the reference category

# Prepare the realised regression data as before
realised_value_added <- read.csv("/Users/michelangelo/Desktop/Master Thesis Code/realised_value_added.csv")
realised_value_added <- realised_value_added[, -1] # Assuming the first column is to be removed
realised_value_added$Date <- as.Date(realised_value_added$Date)

realised_regression_data <- realised_value_added %>%
  left_join(md_data, by = c("FundId", "Date")) %>%
  left_join(IMR, by = c("FundId", "Date")) %>%
  na.omit()

# Define the signed log transformation
signed_log <- function(x) {
  sign(x) * log1p(abs(x))
}

realised_regression_data$Transformed_Value_Added <- signed_log(realised_regression_data$Value_Added)

# Run the regression without the large_cap dummy
realised_regression <- feols(Transformed_Value_Added ~ L1_team * (small_cap + mid_cap)
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))

# View the summary of the regression
summary(realised_regression)

In this revised model, the L1_team:small_cap and L1_team:mid_cap interaction terms will show the differential effect of L1_team for small-cap and mid-cap funds compared to large-cap funds (the reference group). This approach allows you to investigate the effect of fund size on the coefficient of L1_team without running into collinearity issues.

LilMichelangelo2 commented 2 months ago

@autopilot implement a panel regression approach for the following regression. Keep the fixed effects and the clustering:


realised_regression <- feols(Transformed_Value_Added ~ L1_team 
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))
codeautopilot[bot] commented 2 months ago

@autopilot implement a panel regression approach for the following regression. Keep the fixed effects and the clustering:

realised_regression <- feols(Transformed_Value_Added ~ L1_team

Given the collinearity issue with the large_cap dummy variable, one approach to investigate how the size of the fund affects the coefficient of L1_Team is to use a continuous measure of fund size instead of categorical dummies. You can interact L1_Team with the continuous L1_Fund_Size variable directly. This will allow you to estimate the marginal effect of L1_Team across different fund sizes without the need for size category dummies.

Here's how you can implement this in your panel regression:

realised_regression <- feols(Transformed_Value_Added ~ L1_team * L1_Fund_Size
                             + IMR + log(L1_Fund_Age) + L1_Expense_Ratio 
                             + L1_Turnover_Ratio + L1_Raw_Return_Volatility + L1_Net_Fund_Flow 
                             + L1_Cash_Holdings + L1_Equity_Holdings + L1_Tenure + L1_Female
                             + log(L1_Family_Size) | Fund_Family + TIMExOBJECTIVE,
                             data = realised_regression_data, cluster = c("FundId", "Date"))

In this revised model, L1_team * L1_Fund_Size creates an interaction term between L1_team and the continuous variable L1_Fund_Size. The coefficient of this interaction term will indicate how the effect of L1_team changes with fund size.

Remember to interpret the interaction term carefully. The main effect of L1_team will represent the effect of L1_team when L1_Fund_Size is zero, which may not be meaningful if L1_Fund_Size does not naturally take on values near zero. The interaction term's coefficient will tell you how the effect of L1_team changes for each unit increase in L1_Fund_Size.

After running the regression, you can use the summary function to view the results and interpret the coefficients:

summary(realised_regression)

This approach avoids the issue of collinearity that arises from using categorical dummy variables and allows for a more nuanced understanding of how fund size interacts with the L1_team variable.