Appendix: replicate m5 incorporating uncertainty

fsolt commented 4 years ago

Ingredients

[x] public opinion data from Model 5 (via the internal validation test of #2 )
[x] article replication data from Dataverse (use dataverse if possible)
[x] purrr
[x] Rubin's rules

Up for this, Hu?

sammo3182 commented 4 years ago

The theta-hats are drawn directly from the distribution in data/claassen_m5_theta.rda, and 1987 is excluded for being consistent with the theta estimation. The codes addressing the uncertainty (line 237:240 in paper\dcpo_demsupport_appendix.Rmd) requires a double-check. @fsolt If they are correct, the method can be easily applied for the dcpo results once the control is ready.

fsolt commented 4 years ago

Your code looks great to me, Hu, but I think you are, as the newspaper reporters say, burying the lede:

Didn't you just show that C's finding doesn't hold up once the uncertainty in public opinion is taken into account?!? 👀 @Tyhcass

Sheesh—we could stop here and publish this. (I'd be sad if we did, of course, but still we totally could.)

Tyhcass commented 4 years ago

@sammo3182 Wow.........Great job! Although I haven't fully understood the uncertainty codes,, still WOW~~~~ Really?? @fsolt En, for me, I am still looking forward to seeing the results from DCPO!!!

fsolt commented 4 years ago

Ah, I just thought of something, Hu: by taking 100 draws for each country, one year at a time, the current code disrupts each country's time series, doesn't it?

Let me dig into it a bit.

sammo3182 commented 4 years ago

Fred, I didn't get the disruption part. Our thetas already account for the time dependency, don't they? The only difference from Cls at this step is that he drew one value (the mean) but we are using 100. No?

I'm running the entire estimated distribution, a.k.a., the 1000 theta value, which might reduce the confidence intervals a little bit. But...

The only "something" is the first lag of the DV as being red-squared. If 1000 still doesn't reduce the uncertainty very much, the problem should not be attributed to that...

fsolt commented 4 years ago

Okay, well, I wrote an alternate version of Hu's code that skips taking draws entirely—it just reformats all the claassen_m5_theta$theta draws and merges in all the control variables, then runs the model 1000 times and combines the results. When I saw how fast Hu's code ran, I thought we might as well, and it only takes 2-3 minutes. (And you were right, Hu, I was mistaken about the country names being in the theta file. Blech.)

My concern that the time-dependence of the thetas was being lost by extracting them one year at a time was misplaced, too. These results are nearly identical to Hu's:

So, right. Just incorporating the uncertainty in the thetas, as one simply has to do, reverses the main finding of the piece.

sammo3182 commented 4 years ago

So, right. Just incorporating the uncertainty in the thetas, as one simply has to do, reverses the main finding of the piece.

And we have done one thousand times (which is surprisingly fast, isn't it), so, this is the uncertainty of the estimation rather than white noise. On the other hand, hope the DCPO part can bring more conclusive findings.

fsolt commented 4 years ago

On the other hand, hope the DCPO part can bring more conclusive findings.

We're totally fine no matter what we get. The old "logic of presentation != logic of discovery" thing applies. If our "better data, better measure, better method" results are positive, we present this null result first, and then rehabilitate the hypothesis with our better stuff. I agree with you, Hu, that would definitely be fun.

But if we get null results from our better stuff, I think we probably show only our better stuff in the text, push this null result to the appendix with some "just better data" and "just better measure" analyses that we'll need to do, and just refer to appendix in the text as we unpack why our result is different from C's. That will work just fine too.

Tyhcass commented 4 years ago

We're totally fine no matter what we get.

Agree! But, still excited to see what DCPO could give us!
I've uploaded our whole control variables. I also made a little change to dcpo_input_cy.rda. @fsolt Maybe we could rerun DCPO? I found that in the original dcpo_input_cy.rda, we have North Macedonia and Macedonia (changed it name to North Macedonia). There is no time overlap between North Macedonia and Macedonia(of course, they are the same country). However, if we use the two different country names, we will get different theta for these two from DCPO unless it is contry_code used when running DCPO. Anyway, now I unify their name to North Macedonia in both control_variables.csv and dcpo_input_cy_update.rda. Some points abt control variables.

Now, we still have 40 observations with DV missing. They are from two countries, Belize and Mozambique (1987-1993). Vdem doesn't include small countries, like Belize, and doesn't include Vdem for Mozambique between 1974 and 1993, which is confirmed by https://www.v-dem.net/en/analysis/CountryGraph/. I don't know where Cls got the Vdem for Mozambique from 1987 to 1993. We could just simply remove Belize and Mozambique from our data. I also think we could remove Taiwan from our data.
I didn't follow all of Cls's approaches in dealing with control variables since some approaches do NOT make sense to me. For example, Cls used Russia's GDP to fill up USSR successors' missing GDP values. Given the big variations among USSR, I didn't use that way. I just imputed GDP data from WDI and Penn table.
Cls said he used GDP growth rate, but actually, the variable in his data was just growth, not growth rate.... My point is we definitely could do better!! We have already, see the uncertainty results. Anyway, I think we are ready to see DCPO results. Yeah!!! Btw, dealing with CP data, is,,,,,,,,,,,,interesting.

sammo3182 commented 4 years ago

Now, we still have 40 observations with DV missing. They are from two countries, Belize and Mozambique (1987-1993). Vdem doesn't include small countries, like Belize, and doesn't include Vdem for Mozambique between 1974 and 1993, which is confirmed by https://www.v-dem.net/en/analysis/CountryGraph/. I don't know where Cls got the Vdem for Mozambique from 1987 to 1993. We could just simply remove Belize and Mozambique from our data. I also think we could remove Taiwan from our data.

Aye for these solutions. We are illustrating a general pattern about political institution and public mood. To do that does not require a super full survey of every single country of the world at all.

I didn't follow all of Cls's approaches in dealing with control variables since some approaches do make sense to me. For example, Cls used Russia's GDP to fill up USSR successors' missing GDP values. Given the big variations among USSR, I didn't use that way. I just imputed GDP data from WDI and Penn table.

Maybe he wanted to have some comparability with the USSR in the previous years? I support Cassandra's solution. After all, Russia is the real matter for that entry.

Cls said he used GDP growth rate, but actually, the variable in his data was just growth, not growth rate.... My point is we definitely could do better!! We have already, see the uncertainty results.

Can we do both? I can try to calculate the yr growing rate. Just a ratio of the lag, right? Easy-peasy~

Btw, dealing with CP data, is,,,,,,,,,,,,interesting.

Lol~

Tyhcass commented 4 years ago

@fsolt I uploaded dcpo_input_update.rds which has changed Macedonia to North Macedonia due to the reason mentioned above. However, I don't know the process.R in dcpo_demsupport_kfold failed to run it. @sammo3182 run into the same problem. I guess maybe there is something wrong with dcpo_input_update.rds. So, could you please create a rda file with the changed country name and run dcpo and dcpo_kfold when you are available? Then, we can merge DCPO and control variables to do analysis. Thanks so much!

fsolt commented 4 years ago

You mean the two unchecked boxes on my to-do list for #2? Or something else that I've overlooked? 🙀

fsolt commented 4 years ago

Re GDPpc, @Tyhcass, you should check out the New Maddison Project? It's not really new; it only goes up to 2016, but it may be more complete than other sources.

Tyhcass commented 4 years ago

You mean the two unchecked boxes on my to-do list for #2? Or something else that I've overlooked? 🙀

@fsolt En, not that one...... You might overlook my comments 4 days ago in this chain. I copied the main point here. The main point is we have North Macedonia and Macedonia in dcpo_input_cy.rda. But, they are the SAME country. Now, the dcpo_input.rda takes them as two separate countries. I have already updated dcpo_input_cy.rda to modify this issue. I also created dcpo_input_update.rda by changing Macedonia to North Macedonia. However, when Hu and I run dcpo_support_kfold, our jobs failed to run. I am not sure whether there are some problems with my dcpo_input_update.rda file. So, could you please run dcpo_demsupport and dcpo_demsupport_kfold at your end? Thanks.

I've uploaded our whole control variables. I also made a little change to dcpo_input_cy.rda. @fsolt Maybe we could rerun DCPO? I found that in the original dcpo_input_cy.rda, we have North Macedonia and Macedonia (changed it name to North Macedonia). There is no time overlap between North Macedonia and Macedonia(of course, they are the same country). However, if we use the two different country names, we will get different theta for these two from DCPO unless it is contry_code used when running DCPO. Anyway, now I unify their name to North Macedonia in both control_variables.csv and dcpo_input_cy_update.rda.

Tyhcass commented 4 years ago

Re GDPpc, @Tyhcass, you should check out the New Maddison Project? It's not really new; it only goes up to 2016, but it may be more complete than other sources.

Great! I will add one column for Maddison 2018 data.

Tyhcass commented 4 years ago

@fsolt @sammo3182 control_variables with data from Maddison project is uploaded~~ Now, waiting for our DCPO results.

sammo3182 commented 4 years ago

Ahmm... Fred @fsolt, with a close look at the cleanup codes (esp. here), I think you are drawing one fixed line from the theta dataset at all, right? x was not in the function at all. I guess you mean

claassen_m5_theta$theta[x,,] %>% 
...

Corrected me if I'm wrong.

fsolt commented 4 years ago

Exactly right--I pushed the test version; sorry about that!

fsolt commented 4 years ago

(Moved to text, of course)

fsolt / dcpo_dem_mood

Appendix: replicate m5 incorporating uncertainty #3