Dummy Variable Design Matrix

ebossert commented 2 years ago

I am having a really hard time deciphering how the dummy variable equations relate to the design matrix. In the top example I see how b0 = the suburban regular reference group because it is the omitted variable so it becomes the intercept. I don't understand how in the b0 + b2 equation the b2 is 0 when d.tfa in that row is a 1 and then how that equation identifies the suburban TFA. Also, where does the +/- 9 value come from? The only other place I found it was as the number of teachers in urban schools, but I would imagine if the omitted variable keeps changing for each example, then that number would change as well.
Hopefully this all made sense. Thanks!

asukajames commented 2 years ago

@ebossert If I am understanding this correctly, you can find each coefficients by using matrix like what we did in a math class. We know intercept is 66. The first row 75 (d.sub =1) is b1. bo+b1 = 75 (math) 66 + b1 = 75 therefore b1=9

to find b2, 57 = 66+b2 = -9

Lastly, 75=66+9(d.sub)-9(d.reg)+b3(d.sub.reg) b3= 9

This is what/how I figured out! I am happy to show you over Zoom if you prefer that way :)

BrettMFoster commented 2 years ago

I personally don't understand your question. Perhaps Dr. Schlinkert will.

I'm sure I'm the dummy here, but a dummy code is a yes/no question where 0 is the answer 'no' and 1 is the answer 'yes'. In the design matrix, the question is: is this teacher suburban? 1 = yes. The dummy coding is based on a characteristic; for example, all teachers with zip codes of x are equal to 1 where 1 equals 'suburban' in column "c". Then, all zip codes of y are equal to 1 where 1 equals 'regular' in column 'd'.

The dummy codes will probably be used later to determine and control for effects by grouping. So, what are suburban regulars like? What are urban regulars like? How does that effect the outcome? How do differing groups' effects blur out a singular group's relationship with the outcome that we wouldn't otherwise detect.

In a correlation, or regression correlation, the easiest way I have ever found for interpreting the data is: 'As the dummy code goes from 0-1, then math scores do x' or, in other words, one could say 'as group membership goes from Non-Suburban to Suburban the math scores go [up or down]'

lecy commented 2 years ago

@BrettMFoster

The dummy coding is based on a characteristic; for example, all teachers with zip codes of x are equal to 1 where 1 equals 'suburban' in column "c". Then, all zip codes of y are equal to 1 where 1 equals 'regular' in column 'd'.

There are two groups:

suburban vs urban schools (geography)
regular vs teach for america teachers

The treatment here is whether the Teach for America program works, so regular teachers are the control group. Helpful background is that regular teachers have degrees in education whereas Teach for America teachers have college degrees but in any field, then they go through an education bootcamp and are placed in schools for a couple of years. It was designed to recruit more people into education, especially those coming from STEM fields. Participants get loan forgiveness or some financial incentives to participate and certified in education through the process without having to go back to school for it (those details might not be 100% accurate - but something like that).

Because they lack degrees in education and have less experience before starting the job one might assume they will not perform as well. But many are either motivated by service to high poverty areas, or they are placed in schools that need teachers (which are more likely high poverty areas), so they help fill an important gap in talent as well.

This is fake data that is meant to demonstrate cases where one set of dummies represents the treatment/control group and one set of dummies represents an important control variable.

Specifically, how can you use combinations of dummies to create meaningful comparisons.

What's a meaningful comparison? Regular vs TFA instructors in urban schools is apples to apples. And Regular vs TFA instructors in suburban schools is apples to apples.

If you just dump the dummies in the regression and don't worry about which groups are the reference groups then you will end up with non-meaningful or misleading comparisons, like regular instructors in suburban schools compared to TFA instructors in urban schools.

That is an apples to oranges comparison. It's impossible to determine whether performance differences result from teacher quality or from environment.

If you are really careless you would just look at significance of the coefficients without thinking about which hypothesis test it represents.

With two dummies you can test six distinct hypotheses, but only three at a time in any given model.

Figure out which hypothesis you want first, then figure out which dummy variables you need to get that test.

For example, [ A vs C ] and [ B vs D ] are apples to apples comparisons, but you can't test both in the same model.

[ A vs D ] and [ B vs C ] are apples to oranges comparisons.

[ A vs B ] and [ C vs D ] are interesting tests (do teacher types perform consistently across environment?), but they are not tests of the "treatment" (TFA program efficacy) per se.

lecy commented 2 years ago

@ebossert these are not omitted variable problems in this example - we can assume these two variables explains most of the variance. It is just showing that you can only add two of the four dummies (otherwise you get perfect collinearity and a variable drops out), but you get to choose which two.

The final results will be the same each time (you can always recover the group means by adding coefficients).

But the hypotheses that each model contains changes with each choice of variables.

BrettMFoster commented 2 years ago

suburban vs urban schools (geography)

regular vs teach for america teachers

@ebossert but you get to choose which two.

I think I see now. You're saying, for example, 0 = Suburban, 1= Urban; 0=Regular, 1=TFA. You're looking at a comparison then of (Urban vs. Suburban) vs (Regular vs TFA). Is this a point-biserial correlation with math?

And thank you for the background and explanation. This was really helpful for understanding the graph and intent.

lecy commented 2 years ago

Yes, all contrasts (hypotheses tests) are relative to the reference group. The reference group is the omitted group.

So if we include urban_dummy and regular_dummy then suburban+tfa is the reference group.

Each coefficient b1 to b3 represents a comparison against the suburban+tfa group.

S+TFA = S+R S+TFA = U+TFA S+TFA = U+R

Where b0 = S+TFA

And all others would be: b0 + bi (OR b0 + b1 + b2 + b1*b2 for last case)

(I should have used & instead of + to not make it look like addition but you get the point)

lecy commented 2 years ago

You can recover the group means from any of the models with two dummies plus the interaction.

You can’t recover all of the hypothesis tests from any model. Those are unique to the specification (which dummies you used to create the reference group).

ebossert commented 2 years ago

Thank you everyone for the helpful comments! @asukajames after doing some run throughs of the math I can see how you arrived at the values in question using the matrix :) @lecy I greatly appreciate the supplemental notes on this material. It has cleared up for me how dummy variables work and how to use the matrix.

Watts-College / cpp-523-fall-2021

Dummy Variable Design Matrix #12