DS4PS / cpp-525-spr-2020

Course shell for CPP 525 Foundations of Program Evaluation III for Spring 2020.
http://ds4ps.org/cpp-525-spr-2020/
0 stars 0 forks source link

LAB-03 #5

Open sunaynagoel opened 4 years ago

sunaynagoel commented 4 years ago

The section 2.1 of Lecture chapter, shows the statistical model of an OLS regression. When I read the interpretation of result, the coefficient does not make sense to me. I am not sure if I am reading the result wrong.

Screen Shot 2020-04-02 at 12 17 19 PM

Screen Shot 2020-04-02 at 12 17 34 PM

lecy commented 4 years ago

That's a typo! Should have been 0.92, not 1.036.

It has been updated.

sunaynagoel commented 4 years ago

That's a typo! Should have been 0.92, not 1.036.

It has been updated.

Thank you. @lecy

sunaynagoel commented 4 years ago

Section 2.2 of Lecture Chapter. I have a few questions. a. While calculating Fixed effect models (OLS with dummy and Panel FE), is there a reason that company 10 appears right after Public R&D? b. What is the relationship between the OLS with dummy coefficient and intercepts of Panel FE? ~nina

lecy commented 4 years ago

a.

It is the same reason "2" > "100":

> x <- sample( 1:10, 100, replace=T )
> x.f <- factor(x)  # x is numeric
> levels( x.f )
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
> x <- as.character(x)
> x.f <- factor(x)  # x is character
> levels( x.f )
 [1] "1"  "10" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

b.

The dummy variables and the intercept still operate the same.

image

So in section 2.1 the intercept for Company 1 is 56,569 (b0) and for Company 2 it is 56,569 + 35,794 (b0 +b1).

image

lecy commented 4 years ago

@sunaynagoel If you want the companies to sort correctly you would need to use leading zeros :-)

sunaynagoel commented 4 years ago

a.

It is the same reason "2" > "100":

> x <- sample( 1:10, 100, replace=T )
> x.f <- factor(x)  # x is numeric
> levels( x.f )
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
> x <- as.character(x)
> x.f <- factor(x)  # x is character
> levels( x.f )
 [1] "1"  "10" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

b.

The dummy variables and the intercept still operate the same.

image

So in section 2.1 the intercept for Company 1 is 56,569 (b0) and for Company 2 it is 56,569 + 35,794 (b0 +b1).

image

Thank you.

sunaynagoel commented 4 years ago

Is it correct that the difference between a. lm (y ~ x + factor (group)) and b. lm(y ~x +factor(group)-1) is only in terms of intercept? In Model "a" constant is also the intercept for group 1 and model "b" generates coefficient for all the groups (including group 1). If that is the case, then what happens to the dummy variable in Model "b" ? and what is the relationship between constant and each individual group's coefficient?

lecy commented 4 years ago

Yes, that is correct.

When you suppress the intercept (case b) then each group gets its own intercept. There is no longer a reference point - each coefficient represents a distinct group mean, and they all share the same slopes similar to the male female wage example above (note there are no interactions between dummies and slopes).

It is fine to do this because we are not interpreting intercepts directly. They basically act as controls in this model.

You would not want to do this if the groups represented hypotheses of interest. In the wage example above the female dummy tests the difference between male and female wages in the first job. In the diff-in-diff each group dummy represents a different test - test for pre-treatment differences between the treatment and control groups, test for trend (C2-C1), and test for whether post-treatment mean differs from the counterfactual.

If you include distinct dummies for each group (d.treat.pre, d.treat.post, d.control.pre, & d.control.post) the model would report the group means but you would lose all of the tests of your hypotheses.

Make sense?

lecy commented 4 years ago

image

image

sunaynagoel commented 4 years ago

Yes, that is correct.

When you suppress the intercept (case b) then each group gets its own intercept. There is no longer a reference point - each coefficient represents a distinct group mean, and they all share the same slopes similar to the male female wage example above (note there are no interactions between dummies and slopes).

It is fine to do this because we are not interpreting intercepts directly. They basically act as controls in this model.

You would not want to do this if the groups represented hypotheses of interest. In the wage example above the female dummy tests the difference between male and female wages in the first job. In the diff-in-diff each group dummy represents a different test - test for pre-treatment differences between the treatment and control groups, test for trend (C2-C1), and test for whether post-treatment mean differs from the counterfactual.

If you include distinct dummies for each group (d.treat.pre, d.treat.post, d.control.pre, & d.control.post) the model would report the group means but you would lose all of the tests of your hypotheses.

Make sense?

Yes it makes sense. Thank you @lecy

castower commented 4 years ago

@lecy This question is not directly related to CPP 525, but I was curious if the de-meaning process in the OLS model is similar to what we're doing in CPP 528 with finding the z-scores and centering the data from the census to make comparisons. Would that be another example of de-meaning?

-Courtney

lecy commented 4 years ago

@castower It's related to that, yes.

By centering the data in panel models we shifting all of the distributions over to a common axis. In inferential terms this is to account for different initial conditions.

Standardizing the data takes it one step further. We also divide by the standard deviation of each variable so that each variable now has a mean of zero and sd of one. If we are creating an index we care about the variance because we want one unit of item A to be similar to one unit of B .

image

library( ggplot2 )
library( ggpubr )
library( dplyr )

data( iris )

iris <-
  iris %>%
  group_by( Species ) %>%
  mutate( centered = Petal.Length - mean(Petal.Length),
          standardized = ( Petal.Length - mean(Petal.Length) ) / sd(Petal.Length) )

p1 <- ggplot( iris ) +
      geom_density( aes(x = Petal.Length, fill = Species),
               alpha = 0.6 )

p2 <- ggplot( iris ) +
      geom_density( aes(x = centered, fill = Species),
               alpha = 0.6 )

p3 <- ggplot( iris ) +
      geom_density( aes(x = standardized, fill = Species),
               alpha = 0.6 )

ggarrange( p1, p2, p3, 
          labels = c("actual", "centered", "standardized"),
          ncol = 1, nrow = 3 )
castower commented 4 years ago

@lecy thank you!

lecy commented 4 years ago

These are some examples of linear transformations of variables, which are useful in regression when you get into more advanced models.

A linear transformation converts X to a new variable X2 by adding and multiplying by a constant:

Y = mX + b

centering: X2 = (1)(X) - mean(x)

standardizing: X2 = (1/sd)(X) - mean(x)

These transformations impact means and variances, and thus are important to pay attention to in the regression context for understanding how changing the scale of a measure can change inferences.

Measurement error, for example, adds a new perturbation variable similar to the residual term in a regression that has a mean of zero, thus it increases variance but doesn't shift the mean:

u = measurement error random variable

X2 = X + u

There are some examples starting on slide 9 here:

https://github.com/DS4PS/cpp-523-spr-2020/raw/master/lectures/p-09-specification.pdf