RohanAlexander / telling_stories

Telling Stories with Data
https://rohanalexander.github.io/telling_stories/
113 stars 77 forks source link

Luis Comments on Chapter 8 #41

Closed lacorreia65 closed 1 year ago

lacorreia65 commented 2 years ago

Chapter 8 - Farm Data

8.2 Measurement

8.3 Censuses and other Government Data

RohanAlexander commented 1 year ago
  • Comment on " … Black people in the US may limit the extent to which they describe their political and racial belief to White interviewers." • Recently we had a profusion of polls/surveys for Brazilian elections and several institutes/companies of market research were unable to capture the real estimates for the far-right (don't know if this term is correct) candidate, on its majority, in "in-person" interviews. Some analysis have said those voters were shy and didn't wanted to explicitly demonstrate its propension to vote when confronted. Eventually, the results from yesterday 1st round showed some institutes indicated 31%-34% percent to Bolsonaro and his final percentage was around 44%. On the other side, some of by phone polls have indicated an estimated frequency to up to 42% (which is much mode accurate).

Added that this can happen in politics also.

  • I loved your R-code examples, very neat and clear - btw I didn’t know the pipe operator could be also "|>" instead of the traditional "%>%" :)
  • "Regardless of how good our data acquisition process is, there will be missing data. When we talk of this it is important to remember that, in a sense, a variable has to be measured, or at least thought about and considered, in order to be missing." • At first I wasn't sure I understood this statement, but reading it again I think it refers to the 'measurement' limitations some data are subject of, right?

Have re-written.

  • "Non-response is a key issue, especially with non-probability samples, because there is usually good reason to consider that people who do not respond are systematically different to those who do." • This is veeeery good! I liked this very much and never thought about it in this perspective! :)
  • "… much of the changes in public opinion that are reported in the lead-up to an election are not people changing their mind, but differential non-response." • Not sure if I understood what "differential non-response" means. Foe example, in Brazilian elections there will be a 2nd round in 03 weeks. Are we talking about people who voted for one candidate who hasn't pass to 2nd round and don't want to vote in neither candidate from 2nd round and then are willing invalidate their votes?

Have re-written.

  • Last paragraph in Section 8.2 is very thoughtful! I liked that

8.3 Censuses and other Government Data

  • "Census data are not unimpeachable, and common errors include under- and over-enumeration, as well as misreporting (Steckel 1991) and there are various measures and approaches used to assess quality (Statistics Canada 2017)." • Something came up into my mind when reading this paragraph. Much is being said on Brazilian elections polls and how to access their quality due to a couple of huge variation among institutes. Do you think some of the approaches suitable to censuses could be also applicable to election polls to access their quality? What would it be?

Added a note about this.

  • "However, the term has become a little contentious because of how it has occurred in practice; the government is only providing data that it wants to provide, and will not make it look bad." • I think you are right. • When reading this phrase it suggested me that, at some extension, intentional biases might be included in government data order to appear more palatable. And I wouldn't doubt it if we think in electoral period in Brazilian elections (I know…, I'm a bit disappointed about how politics are treated in Brazil in the last 02 decades)

Have added some examples of this.

  • Another nice command I learned with your code - "slice_max()" :)
  • Btw, Canada databases are very cool!
  • Sub-section 8.3.1 - Maybe there is a typo in the phrase "And finally, list_census_vectors() provides the metadata about the variables that available." • " … about the variables that are available."?

Fixed!

8.4 Sampling essentials

  • "Wu and Thompson (2020, 3) describe statistics as 'the science of how to collect and analyze data and draw statements and conclusions about unknown populations.'" • This is a very nice description ;)

Yes, it's a lovely book!

8.4.2 Probabilistic sampling

  • "The most important aspect to be clear about with probability sampling is the role of uncertainty. This allows us to make claims about the population, based on our sample, with known amounts of error." • This excerpt makes me wonder about the questioning the research institutes are facing about Brazilian elections right after the votes of Oct 2th. Several (not to mention all) institutes have made wrong predictions about the voter's intentions when compared with what happened after the 1st round. Researchers are blaming the media vehicles who have presented the polls as a forecast for the elections results and they explained this is a misuse of it. I particularly understand that those polls have two limitations: 1-they reflect the "instant intention" of voters like if the elections would be on the day of the interview (they can change their mind when in front of the urn); and 2-I think they present the probabilistic error as the sample were taken in a simple random sampling when they are really doing a stratified or cluster sampling (or "combined" when they select a number of municipalities and then randomly select the participants), and the C.I's are wider in latter, I guess. Besides, inside each cluster or strata, they use the quota achievement that may incur on the problem you mentioned earlier of the interviewer bias. • Another comment - the codes are very well elucidative of the theory behind it :) • I loved the toddlers example for measuring parents sleep ours :) 8.4.3 Non-Probabilistic sampling • Comment that popped up into my mind when reading this chapter - In Brazil, research institutes do an hybrid sampling method: 1-cluster sampling to select municipalities within the country (not states); 2-then in 2nd level they use a stratified sampling to select people according last census percentages of (a) education, (b) gender (male/female), (c) age and (d) income; and finally 3-within those cluster/strata they select individuals by quota. I haven't heard about snowball sampling - it is very interesting and well descripted. I wonder if, depending on what we are trying to achieve with the survey, after choosing each subsequent stages, if we are not biasing our survey in some way, considering the next k individuals might be correlated in some way with the previous' stage.