Closed PattyTimmons closed 2 months ago
Hi @PattyTimmons, it appears that you would create your own RMD file for this lab. best
Thanks @AntJam-Howell. Question, is there a link for the data file csv? Or am I not thinking of this error correctly?
Error in eval(mf, parent.frame()) : object 'dat' not found
Hi @PattyTimmons,
Strictly speaking this lab does not require you to use RMD file and R code to answer the questions. They questions are answerable by looking at the model results shown in the lab (e.g. model 1 and model 2, etc.).
If you wanted to replicate the model results though for extra coding practice, that is also perfectly fine. To do that, please incude the following codes to access the data:
URL <- "https://raw.githubusercontent.com/DS4PS/cpp-523-fall-2019/master/labs/data/engineer-salaries.csv"
dat <- read.csv( URL, stringsAsFactors=F )
head( dat ) %>% pander()
@AntJam-Howell Thank you for the link, I did want a bit more practice, but I have another question. In model 1 and 2, are the reference groups for model 1 and 2 white males?
@PattyTimmons Nice! And yes for model 1 that is correct. Model 2 the reference group is different. Just as the coefficients on the dummy variable groups changes in model 2 compared to model 1 so to does the reference group.
@AntJam-Howell Professor Howell, I am confused on how to determine the m2 reference group as well.
How are we able to determine the reference group is minority female? I was only able to figure it out after answering Q4 in the lab which seems to give it away (what is the salary for female minorities).
Could you elaborate on how we would know that just by looking at the models?
Hi @swest235 It is difficult to know by just looking at the model results. You need to first understand how the variables are coded. In the simplest case, for a binary (dummy) variable like gender that is coded
Defining the Reference Group In this setup, the reference group is defined by the category for which the dummy variable takes the value of 0, which, in our case, is male, meaning that all females are assigned a 1 value. The regression equation for this model can be written as:
Income = β0 + β1*Female + ε
In this equation:
Income is the dependent variable. β0 (beta zero) is the intercept, representing the average income for males (since Female = 0 for males, making males the reference group). β1 (beta one) is the coefficient for the Female dummy variable, representing the difference in average income between females and the reference group (males). ε (epsilon) represents the error term.
Changing the Reference Group To change the reference group, one could redefine the dummy variable so that male = 1 and female = 0. This redefinition would make females the reference group. The constant in the regression equation would then represent the average income for females, and the coefficient of the dummy variable would represent the difference in average income for males compared to females.
Implications for the Constant Changing the reference group alters the interpretation of both the constant and the coefficient of the dummy variable. The constant always represents the average outcome (income, in this case) for the reference group, and the coefficient of the dummy variable indicates the difference in the outcome between the non-reference group and the reference group.
Selection of the Reference Group The choice of the reference group and the interpretation of the constant depend on the context of the study. For instance, in analyses examining racial disparities, it's common to select "white" as the reference group to assess how income differs for ethnic minorities (e.g., Black, Hispanic, Asian) compared to the population majority. This selection is typically guided by the research question, with the reference group often being the category against which other categories are compared to highlight disparities or differences.
@swest235 Now lets consider the multiple OLS equation with 3 dummy variables (instead of 2 as in the lab).
Expanding the simple OLS model to include two additional explanatory variables—a dummy for ethnicity (1=white, 0=minority) and a dummy for region (1=North, 0=South)—we now have a multiple regression model that examines the influence of gender, ethnicity, and region on income. The regression equation for this enhanced model can be represented as:
Income = β0 + β1*Female + β2*White + β3*North + ε
Income is the dependent variable. β0 (beta zero) is the intercept, representing the average income for the reference group, which is minority males in the South, based on the coding of the dummy variables (Female = 0, White = 0, North = 0). β1 (beta one) is the coefficient for the Female dummy variable, indicating the average difference in income between females and the reference group, holding ethnicity and region constant. β2 (beta two) is the coefficient for the White dummy variable, showing the average difference in income between white individuals and minorities, holding gender and region constant. β3 (beta three) is the coefficient for the North dummy variable, reflecting the average difference in income between individuals living in the North and those in the South, holding gender and ethnicity constant. ε (epsilon) represents the error term, accounting for variations in income not explained by the model.
Interpretation of the Constant In this enhanced model, the constant term (β0) represents the average income for the reference group defined by all three dummy variables being set to 0. Specifically, this means:
For the gender dummy, Female = 0 implies the reference category is male. For the ethnicity dummy, White = 0 implies the reference category is minority. For the region dummy, North = 0 implies the reference category is South. Therefore, β0 denotes the average income for minority males in the South, as these categories are omitted (reference groups) in the regression equation based on how we've coded our dummy variables.
Changing the Reference Group Just as with the single dummy variable model, changing the coding of any dummy variable would change the reference group and thus the interpretation of the constant and the coefficients. For example, coding the ethnicity variable as (0=white, 1=minority) would make white individuals the reference group for ethnicity, altering the baseline against which differences are measured.
Conclusion In multiple regression models with dummy variables, the constant term represents the average outcome (here, income) for the omitted reference group across all categorical variables included in the model. Each coefficient for the dummy variables then indicates the average difference in outcome between that category and its reference group, controlling for the other variables in the model. The selection of reference groups should be guided by the research questions and the context, aiming to provide meaningful comparisons and insights into the factors influencing the outcome of interest.
@AntJam-Howell so in the lab example, it is deduction that tells us it is a minority female as the reference group in model 2.
If male is 0, that means female, if white is 0, it is a minority, if malexwhite is (0)(0) then that leaves us with a minority female.
That is more straightforward than I was making it out to be, thank you for the clarification.
I did have an additional question from the hypothesis test pdf. Where did this 9, -9, 9 come from?
Those are the regression coefficients that would be needed to accurately replicate each group mean. In the top case you have a model with one dummy for each group (the 4 groups being a combination of two binary variables: 00, 01, 10, 11) and no intercept. In other words, a model where each beta (b1-b4) will capture the group mean. But it also has no reference group and thus each p-value associated with the regression coefficients are testing the hypothesis that each group mean is different than zero.
If you want a more meaningful hypothesis test you need to create a reference group by deciding which cases are omitted. Once you create a reference group the coefficients become additive instead of summative and the p-value represents a meaningful difference when compared to the reference group.
If you have two binary variables you could run 4 different models that are mathematically identical (you would get the exact same group means in all cases) but each uses a different reference group. Your choice of reference group determines which hypotheses will be tested by the model.
y ~ sub + reg + sub.reg # D is the reference group
y ~ sub + tfa + sub.tfa # B is the reference group
y ~ urb + reg + urb.reg # C is the reference group
y ~ urb + tfa + urb.tfa # A is the reference group
@AntJam-Howell Hello, just curious on Lab 5 if there is an rmd file to download or will we need to build the file completely.
Thanks