Corrections to final project for use as sample

rpseely commented 4 months ago

Tracking Corrections to Final Project

As instructed by Professor Buzard, I will be making some edits to the final report write-up and the do-file for potential use as a sample project in future sections of ECN 310. I will be using the critique from the grading rubric used by Prof. Buzard to guide my editing of the project.

rpseely commented 4 months ago

Saturday, February 10, 2024

I started today's session by reviewing the comments made by Professor Buzard on the grading rubric.
I began editing the code of nepobabies.do by looking at the two-sample t-tests of the demographics and adding a line of code that would prevent the t-test from double counting the male nepobabies in both the nepobabies population and the sample population. I still need to run this code to see if it worked.
I then started making edits to the abstract and introduction on the NepobabiesFinalReport.tex file. They were made for clarity and grammar/conventions purposes. I will not explain each change made on this issue as it should be fairly obvious. When it comes to substantive changes in the writing, however, I will make sure to explain why the change is being made.

rpseely commented 4 months ago

Sunday, February 11, 2024

I started by attempting to create a better definition of how we defined the month of hire in the data section. What I have so far is certainly more comprehensive, but should be edited for better clarity/readability.
I made minor edits to some phrasing, also in the data section.
I skipped the literature review section for now, as I want to examine that in more detail after reviewing the more comprehensive lit. review that we completed in the google doc.
My second chunk of work on the project today was looking through the results section. I found an error/mismatch in our analysis of the mid-low and mid-high unemployment rate groups. I am not totally sure where this error is coming from, so I will have to open the dataset and do-file up and see what is going on. At least part of the error is a mismatch (in paragraph four of the results section) where we begin talking about a comparison between the mid-high and mid-low groups but then talk about the comparison between mid-low and low groups while writing about the same t-test, so this will have to be fixed. I believe that where it says mid-low and low is the part that is wrong, so that must be corrected, and a two-sample t-test between mid-low and low should be added in to that we have further evidence when talking about the possibility of nepotistic relationships being formed during most economic conditions, except when times are really good.
On that note about the fourth paragraph of results, it could probably be split into at least two paragraphs. The sentence that starts with The final two t-tests we ran provide evidence... could be the start of another paragraph.
Additionally, I believe that the two-frequency tables might be suitable to be combined into one, as they are somewhat redundant, and it would be more helpful to have all of that information in one. As I was reviewing the first frequency table, I found myself scrolling back and forth to the second one to check the information it contained.

Note: These changes and notes on errors are all on my iPad and have not yet been edited into the .tex document.

rpseely commented 4 months ago

Monday, February 19, 2024

I made some edits to the data section for much better understanding/clarity.
@kbuzard In your grading rubric you suggested that we be clearer about why we chose the 30 year old criterion. When we made that decision, I think we were trying to keep our sample size as large as possible while also keeping the sample to those who reasonably could've been entering the workforce, or making an early-career industry change, that would have been helped by their parents. That is, we did not look for or see any strong reasoning for 30 years old being a cut off, rather we just made a reasonable cutoff. Should I explain that in the data section, and then maybe in the discussion explain that 30 years old is not a perfect cutoff?
I am still working on an improved definition of our estimated month of hire. I will put below what I have so far, but it is definitely way too wordy and will absolutely have to be edited down.

Hire month estimation explanation: Using the job length variable and the GSS interview date variable, we estimated the month and year that a respondent was hired. The estimated month of hire was created by taking the interview date and subtracting the number of years of employment at the respondents current job times 12. This means that the estimated month of hire is always the same month as the month a respondent was interviewed (unless a respondent had been employed for less than one year). This is due to the responses to the job length question only counted in one quarter of a year, three quarters of a year, and then years in counting numbers starting from one.

rpseely commented 4 months ago

Monday, February 19, 2024 (continued)

I continued looking at the analysis section (the parts I had not already looked over) and made some minor edits where necessary for better readability. I do have some questions for @kbuzard , but I figure it will be easier/quicker to discuss in person, so maybe we can set up some time this week/next. The questions (from today) mostly pertain to best practice. I still need to look through the code for the earlier part of the results section where I found a discrepancy/have some confusion. These edits were made on my iPad, but I have been tracking when I make edits to the .tex file and marking it off on my PDF.
The discussion section largely looked good. A couple small fixes and it should be fine.
One typo in the conclusion, otherwise that section looked good.
I want to also go back and edit the bibliography. I wonder if there is a way to just correctly format the references correctly (hanging indents, double spaced?) without having to make them like overleaf/LaTeX sources.

Update: all minor changes, i.e. spelling, conventions, word choice (non-substantive), have been updated from iPad PDF to .tex file.

kbuzard commented 4 months ago

Should I explain that in the data section, and then maybe in the discussion explain that 30 years old is not a perfect cutoff?

I think this is a good plan!

On the explanation of the "hire month" variable: what you have is a good start, but I agree it could be clearer. It doesn't necessarily have to be a lot shorter. In particular, you can integrate a discussion here of the windows of time you used to figure out when the hire month was during a bad labor market, and any robustness checks to ran to deal with this (or robustness checks you think should be run even if you're not going to include them in this report).

I want to also go back and edit the bibliography. I wonder if there is a way to just correctly format the references correctly (hanging indents, double spaced?) without having to make them like overleaf/LaTeX sources.

You can do this for sure; google/chatgpt can help you out. I'm never sure whether it is more frustrating to make and implement a .bib file or to try to get LaTeX to format things the way I want them. It's probably six to one, half a dozen to the other. It will look more impressive if done with a .bib file, but I don't think it's necessary.

it will be easier/quicker to discuss in person, so maybe we can set up some time this week/next

Sure! Let me know what will work for you. I'm going to be on campus for at least part of each of the next three days, and I'm also happy to Zoom (we could even just stay on after the meeting tomorrow night).

rpseely commented 4 months ago

Wednesday, February 21, 2024

@kbuzard and I discussed edits to the analysis section after the team meeting today. Here is what we came up with:

1. Definition of month hire conversation:

Make sure that an error in the measurement of months is mentioned in the discussion section.
Look into a robustness check again. See if we can do this (without spending too much time) and then make a determination about whether to emphasize it in the discussion section or add a robustness check to the analysis.
It is also worth looking into the location-level data. Is it even available in the public GSS data? If so, look around and see what is going on there. But most likely, it should be mentioned in the discussion section with the caveat looking around in the data may not even be super useful as we do not know where people hired.

2. Quartiles categorized by unemployment level and t-tests

Talk about how we chose the quartile levels. The quartiles were chosen based off of the unemployment rates, not based off of the amount of people. See if we can re-classify the quartiles based on people.

3. Chi-square vs. t-test

When it comes to how we analyzed the association between unemployment levels and nepotism status, it is not worth including the t-tests when they do not offer anything that the chi-square analysis does not.
For the demographic/work-characteristic information, every t-test should be re-run as a chi-square analysis because that is the correct analysis for two categorical variables! Re-examine what the results are after and then update the analysis as needed.
For the chi-square analysis already there for the association of nepobaby status and unemployment level, briefly explain why it was used: examining association between two categorical variables.

4. Graphs/Tables

Get rid of the first table and graph entirely.
For the second table (now the only table!), add in a percentage row to show the proportion of nepobabies in each category.

5. Gender dynamics analysis

Definitely re-run the chi-square analysis as for the other characteristics and see what is going on there.
Can definitely cite a paper on gendered dynamics in the workforce and potentially the gendered dynamics of fathers/sons and mothers/daugthers working in the same jobs.
Then revisit whether men are more likely to take advantage of these resources and/or more likely to offer these resources.

rpseely commented 4 months ago

Monday, February 26, 2024

I added in a chi-square test for the relationship between nepobaby gender and parent of nepobaby gender. I only had limited time so I didn't run it with all the other code, but at least now I know the few lines I wrote should work for the chi-square test when it is done with all the other code. I also understand better about how I will need to re-setup the variables for a chi-square analysis.

After speaking with Professor Buzard, we made the determination that the updates to the project should be done right around spring break!

rpseely commented 4 months ago

Tuesday, February 27, 2024

I ran the code for the first chi-square analysis about the relationship between nepobaby sex and parent sex. This is expected as we had a strong result for this anyway.
I want to try to figure out a bar graph where one bar is where nepoparentsex is female, and there are two smaller bars within that bar, stacked vertically, representing the proportion of male nepobabies with a nepotistic mother and female nepobabies with a nepotistic mother, and then the other bar would be the same but where nepoparentsex is male.
I added proper t-tests and chi-square tests for income, class, race, gender, and hours worked. Like the t-tests, these yielded no significant differences between the nepobaby sample population and the non-nepobaby sample population.
I also added a proper chi-square test for the job safety category, which yielded a significant result like the t-test.
I am currently looking through the literature to find some prior analysis showing that mothers are more likely to be in nepotistic relationships with daughts and fathers with sons. I came across something showing that fathers are more likely to be influential in creating nepotistic relationships, so I should add a chi-square to see if fathers are more likely to be the nepotistic parent. Link to said paper

rpseely commented 3 months ago

Tuesday, March 5, 2024

Editing frequency/proportion table

I started by changing the frequency table to include the proportions, as discussed. I would appreciate your feedback, @kbuzard, on how you think it looks as is. I also added in the code for the table into the github copy of the .tex file.

There was an error in the figure that showed the ratios of nepobaby to non-nepobaby based on hiring group. There was a different proportion shown on the graph than calculated by me today looking at the new table. The one calculated today was correct. I corrected this error, updated the code, and updated the figure in the .tex file.

Nepobaby sex vs. Nepoparent sex Figure

I then created a grouped bar chart for the proportion of nepobaby sex by nepoparent sex. This was laborious (because I accidentally deleted the code twice and could not get it back) but ultimately produced a really useful (really cool) graphic that I think I will put in the final report. I uploaded it to GitHub and it is titled "Nepoparent.sexproportions.pdf"

Sensitivity analysis

After some good back and forth with ChatGPT, it seems like trying to code it out and not load in different unemployment rate datasets will be very tricky. I tried to do this and each time I got over a thousand missing observations out of a sample of about 3,500. I believe the best course of action would be to make different copies of the FRED unemployment rate data. In each FRED dataset I would create a variable called ymhiredate_3m (for minus three months) or ymhiredate_6p (plus six months, and son on), except I would create a corresponding variable before merging with the same name, as I did for the original analysis. @kbuzard Let me know how you feel about this! This is something I could do over the weekend.

rpseely commented 3 months ago

Tasks as of March 5, 2024

I want to organize what I still have to do at this point...

Sensitivity analysis

[x] Create different datasets from FRED, merge based on new ymhiredate variables, re-run chi-square analysis.
[ ] Add sensitivity analysis into paper.
[ ] Make sure to add detailed documentation on how I did the sensitivity analysis.

Gender dynamics

[x] Add in the new grouped bar chart
[ ] Add a new piece of literature into the lit. review

T-Test vs. Chi-Square

[x] Delete the t-test portion of the analysis section
[x] Make sure that the demographic tests of the analysis section say chi-square and not t-test (started)
[x] Fix the chart as spoken of below between myself and Prof. Buzard

Fix up code

[ ] Add in comments where necessary from newer work
[ ] Get rid of all old work that is being replaced

Data Section

[x] Final draft of hire month definition

References

[x] Put the references in a .bib file

Final Product

[ ] Compare and correct differences between .tex file on github and the file on overleaf
[ ] Ensure that the references in the .bib file compile in overleaf.

kbuzard commented 3 months ago

I started by changing the frequency table to include the proportions, as discussed. I would appreciate your feedback, @kbuzard, on how you think it looks as is.

@rpseely I think it looks good! My only suggestions are to

replace "Sample" with something like "# Observations". This is just the more common usage.
think about changing the term "hire group". It's just not obvious what this means if someone glances at the chart. Because many people skim papers by looking at the figures, it's always a good idea to explain everything in the chart, even if it requires notes beneath or something similar.

I believe the best course of action would be to make different copies of the FRED unemployment rate data. In each FRED dataset I would create a variable called ymhiredate_3m (for minus three months) or ymhiredate_6p (plus six months, and son on), except I would create a corresponding variable before merging with the same name, as I did for the original analysis. @kbuzard Let me know how you feel about this! This is something I could do over the weekend.

Is what you're thinking about a single dataset with multiple variables, where column is "offset" by 3 or 6 months? If so, I think that makes sense.

rpseely commented 3 months ago

think about changing the term "hire group". It's just not obvious what this means if someone glances at the chart. Because many people skim papers by looking at the figures, it's always a good idea to explain everything in the chart, even if it requires notes beneath or something similar.

Would putting "Unemployment Level at Time of Hiring" be good for the title? And then in the chart I am not sure what else would be short enough and offer enough explanation, so I think adding in a note beneath would be best.

Is what you're thinking about a single dataset with multiple variables, where column is "offset" by 3 or 6 months? If so, I think that makes sense.

Hmmm. That could work, too. My thought was to essentially create multiple copies of the FRED unemployment rate dataset that we used to merge the unemployment rates in, and then each copy of the data would have the month offset and the corresponding ymhiredate_xx variable. I will see if I can whip one of these up before our meeting to show it.

I actually just understood your idea with the other approach! I think I could just go back to the original unemployment rate dataset and create multiple columns with the correct offset. I will try that, first.

kbuzard commented 3 months ago

Would putting "Unemployment Level at Time of Hiring" be good for the title? And then in the chart I am not sure what else would be short enough and offer enough explanation, so I think adding in a note beneath would be best.

@rpseely This is a great title! You could potential just not use the short descriptor if it's possible to leave that upper left box blank. If you have a good title and everything is otherwise well described, I think it will be clear this this is the only thing you're analyzing in the table.

rpseely commented 3 months ago

Saturday, March 9

Sensitivity Analysis

Tried it with manipulating excel sheet but hit a roadblock

I've updated the original FRED unemployment rate data set to contain the necessary variables for the sensitivity analysis and uploaded it to GitHub. I then converted that excel sheet into a .dta stata file.
Now, I am trying to actually perform the analysis, but am I having a hard time wrapping my head around how it will work, in terms of the code. I am taking a break from this and will come back to it. My current thought now is to go back into the code and see where I tied the ymhiredate to each observation.

Explanation of why I don't think it can work using one excel sheet like so

Here is where the issue lies. I created the ymhiredate for each observation based on their interview year and month and the amount of time each respondent held their job the code was:

gen ymhiredate = ymintdate - (yearsjob*12)

Then in the FRED unemployment rate dataset I created the ymhiredate variable manually. Then I merged the FRED and GSS datasets based on the ymhiredate variable. Here is the code:

merge m:m ymhiredate using "C:\Users\rpseely\OneDrive - Syracuse University\Documents\GitHub\exercises\course-project nepobabies\FRED_unrate_60to22_robust.dta"

So now, as you can see in the screenshot I shared above, each observation has the ymhiredate for each of the timeframes we want to check for the sensitivity analysis. Except, it only has the unemployment rate of the originally defined ymhiredate, and it has the other ymhiredate_xn variables, but that does not really do anything other than assign an additional, nonfunctional value (as of now).

Trying to code it out

At this point, I think the best way to code it would be something that like this:

gen unemployrate_m3 = unemployrate of ymhiredate - 3

That is not the correct usage of the "of" command in Stata, but that is what I am looking for. It does not seem like ChatGPT can do it like this either. ChatGPT gave me this line of code which bases the unempoyrate_m3 on the average of the unemployment rate of three months before the current observation (239 missing observations).

bysort ymhiredate: egen unemployrate_m3 = mean(unemployrate[_n-3]) if _n > 3

I guess that is still checking how robust our results are, but it seems a little messy/noisy?
I then asked ChatGPT to give me the same thing, but instead of an average of the last three months, just the value of month three before the current one. It gave me this line of code, which gave me 1,192 missing observations:

bysort ymhiredate: gen unemployrate_m3 = unemployrate[_n-3]

When I ask ChatGPT for a better explanation of what that line of code does, it sounds like it gives the unemployrate value for the observation that is three prior, not the unemployrate value for the ymhiredate that is three less than the current one. I am fighting it out with ChatGPT but it just does not seem to want to give me the code I am looking for. I will come back to this on Monday or Tuesday.

Tentative success?

I gave it one last go using an excel sheet where I manually made the ymhiredate 3 months prior to the current one, as shown above. The difference is that it only had the ymhiredate_m3 and the unemployrate variable (labeled as unemployrate_m3 for clarity)/ At first it did not work because I had a line of code that negated this (gen ymhiredate_m3 = ymhiredate - 3). Then I removed this line of code and simply set ymhiredate = ymhiredate_m3. Then I re-ran the rest of the code, and I believe the code run the robustness check I intended to! Unless Professor Buzard has any qualms with this method, I will go forward with the same method for the sensitivity analysis for minus 6 months, plus 3 months, and plus 6 months. The p-value was 0.089, so it does not pass at the 0.95 significance level, but there may still be something going on there. Definitely something to make note of, especially in light of the results of the next few parts of the robustness check.
NOTE: do not forget to remove observations for those who have held a job less than one year for the m6, p3, p6 checks.

kbuzard commented 3 months ago

I gave it one last go using an excel sheet where I manually made the ymhiredate 3 months prior to the current one, as shown above. The difference is that it only had the ymhiredate_m3 and the unemployrate variable (labeled as unemployrate_m3 for clarity)/ At first it did not work because I had a line of code that negated this (gen ymhiredate_m3 = ymhiredate - 3). Then I removed this line of code and simply set ymhiredate = ymhiredate_m3. Then I re-ran the rest of the code, and I believe the code run the robustness check I intended to! Unless Professor Buzard has any qualms with this method, I will go forward with the same method for the sensitivity analysis for minus 6 months, plus 3 months, and plus 6 months. The p-value was 0.089, so it does not pass at the 0.95 significance level, but there may still be something going on there. Definitely something to make note of, especially in light of the results of the next few parts of the robustness check.

This sounds like what I was envisioning!

rpseely commented 3 months ago

Monday, March 11

Sensitivity Analysis

I successfully imported the necessary datasets for the sensitivity analysis and then converted them into .dta datasets. I then performed the analysis and will be writing up those results soon. Preview: At two levels there was a significant association between being a nepobaby and the unemployment level at a 0.90 significance level, at one level there was a significant association at a 0.95 significance level, and at one level there was no significant association.

References

I put all my references into a .bib file using Zotero. However, after I uploaded the .bib file into overleaf, when I recompiled the document with the .bib file, nothing appeared. Here is the code I used in overleaf:

\newpage \section*{Bibliography} \singlespacing \setlength\bibsep{1pt}

\bibliographystyle{plain} \bibliography{nepobabiesreferences}

\end{document}

@kbuzard Do you have any advice on this? Or any resources you have found helpful?

kbuzard commented 3 months ago

Do you have any advice on this? Or any resources you have found helpful?

Some questions to start:

Do you have \usepackage{natbib} in your preamble?
Do you get any error message?
Have you tried \bibliography{nepobabiesreferences.bib}? Some of my files have the ..bib and some don't. I'm not sure if it matters.
Have you tried moving \bibliographystyle{plain} to the preamble?

kbuzard commented 3 months ago

@rpseely I forgot to tag you in above post. Not sure you'd get a notification, so here's one for sure!

rpseely commented 3 months ago

Tuesday, March 12

Editing Final Report

I added in the final draft of the hire month definition to the data section.
I corrected the language in the .tex file to reflect that correctly performed chi-square analyses were done, as opposed to the previously done t-tests.
I removed the two-sample t-test portion of the analysis from the first part of the results section where we analyze the association between labor market competition at time of hiring and nepotism status.

References

@kbuzard Thank you! I tried all of those suggestions (and combinations of them) and I have not been able to get it to work.

Do you get any error message?

Yes, I get one error message related to the bibliography. It states:

Package natbib Warning: Empty `thebibliography' environment on input line 3.

I ChatGPT'd what this means, but nothing that ChatGPT says might be wrong is apparent to me.

kbuzard commented 3 months ago

@rpseely Okay, given that error, I think I might know what's going on. Check out this answer to this question for details.

the \bibliography command ONLY prints the references for papers that are cited in the paper. And by "cited,' I mean programmatically with the \cite{} command or similar. My best guess is that you've hard coded the references in the body, but natbib doesn't see them as references.

rpseely commented 3 months ago

the \bibliography command ONLY prints the references for papers that are cited in the paper. And by "cited,' I mean programmatically with the \cite{} command or similar. My best guess is that you've hard coded the references in the body, but natbib doesn't see them as references.

This was exactly right! I must go back into the overleaf file and put in the \cite{} commands in, within the literature review, and then it should correctly compile the bibliography. I also want to change the citation style.

rpseely commented 3 months ago

Wednesday, March 12

Results Error

As I was going through the sensitivity analysis, I noticed that something seemed a little off with the results so I went back to check through a few things. I tabulated the nepobaby and yearintv variables and saw that there were 0 nepobabies that were interviewed in 2022, one of our five survey years. This was clearly troubling, so I was checking why that might be and found that the maind10 and paind10 variables were "not available in this release" for 2022. Therefore, there could not be any nepobabies from 2022 and the results are not valid.
I re-ran the chi-square tests without the 2022 sample and with quartiles altered for the available data, and there was no significant association between nepotism status and unemployment rate at time of hiring. This directly conflicts with what we said our findings were in our final report.
I then tried to see if any demographic of the sample might have a significant association between their nepotism status and unemployment rate at time of hiring. I created the appropriate variables and ran chi-square tests for the following demographics: white respondents, male respondents, female respondents, respondents who reported themselves are middle or upper class, respondents who reported themselves as lower or working class, respondents with a high school degree or more education, respondents with less than a high school education, respondents with a college degree or more education, respondents less than 26 years old, respondents greater than 25 years old, and respondents greater than 29 years old. For none of these demographics was there a significant association. There was a significant association for the age variable, but I don't think that really holds any meaning, as agehire is what I had meant to put there, anyway, and there was no significant association for agehire.

Going forward

This means that the report will have to be edited for each place where we claim there is a significant association. On top of that, I think it would be best to alter the story of our results. Obviously the story comes to a different conclusion, but I think other parts of it should be edited so it is not just "no significant association." I think the narrative should now just be that young adults rely on their parents at all times, not just during hard times, or something along that narrative. I already ran a chi-square test confirming that more young adults are in the same industry as their parents than older adults (respondents are exlcuded if their mother OR father industry is missing). I would also want to include some more literature on young adults and nepotism.

kbuzard commented 3 months ago

@rpseely Well, now you've truly had the research experience! I don't mean to sound flip...this is just the kind of thing that happens all the time. It's really frustrating, but it's just life in this business.

The first thing I would suggest is kicking the tires a little bit. With no context, a would be a little surprising that dropping the 2022 data would overturn your results if that was just one fifth of the sample. I guess this could be because u-rates were really low in 2022, and so your data said that all these people weren't nepobabies. But I think it's worth digging into this a little bit to make sure you really believe the new result.

The next thing I'd try after that is to make a scatter plot with the following: unemployment rates on the x-axis, and the percent of nepobabies in each of those unemployment-rate groups (so what % of people are nepobabies when the unemployment rate is 5.0? What about 5.1? What about 5.2, etc...

This new finding makes me think that this is something subtle going on with the composition of your sample -- that is, which ranges of unemployment rates you even see for the four years that you have GSS data. Understanding the underlying data better is probably the next step to deciding how to proceed.

rpseely commented 3 months ago

Tuesday, March 19

Results Error Exploration

The observations from 2022 make up 698 of the 3,550 observations we originally analyzed (about 20%).
The unemployment rate data for 2022 interviewees is actually more right skewed than the whole sample (adding non-nepobaby results to the higher unemployment rate groups), however below the average, the data are skewed way more towards the lower unemployment groups. for example, the value at the first quartile for the whole sample, not including 2022 data, is 4.6%. With just 2022 data, the first quartile unemployment rate is 3.8%. That appears to be why removing the 2022 data creates that result.
I created a couple graphs showing how the data differs with and without the 2022 data, showing the ratio between nepobabies and non-nepobabies over different unemployment rates. I then added best-fit lines, which is super helpful in showing why the results change with/without the 2022 data. These can be viewed below.
Then with the idea and some help from ChatGPT, I learned about beta regressions, which is a kind of regression where the dependent variable is a ratio (I first tried a regular regression, but then realized this did not make sense because the dependent variable is not continuous). Super cool! I ran a beta regression without the 2022 data and got a null result. I then ran a beta regression with the 2022 data and got a strong result/association.
Obviously this if fairly disappointing/annoying, but silver lining...learning new and cool ways to look at data and test associations!

rpseely commented 3 months ago

Tasks as of March 24, 2024

I want to organize what I still have to do at this point...

Sensitivity analysis

[x] Create different datasets from FRED, merge based on new ymhiredate variables, re-run chi-square analysis.
[x] Add sensitivity analysis into paper.
[x] Make sure to add detailed documentation on how I did the sensitivity analysis.

Gender dynamics

[x] Add in the new grouped bar chart
[x] Add a new piece of literature into the lit. review

T-Test vs. Chi-Square

[x] Delete the t-test portion of the analysis section
[x] Make sure that the demographic tests of the analysis section say chi-square and not t-test (started)
[x] Fix the chart as spoken of below between myself and Prof. Buzard

Fix up code

[x] Add in comments where necessary from newer work (and in correct results in the code)
[x] Get rid of all old work that is being replaced, i.e. incorrect t-tests
[x] Add code from the logs into the nepobabies.do file.

Data Section

[x] Final draft of hire month definition

References

[x] Put the references in a .bib file

Final Product

[x] Compare and correct differences between .tex file on github and the file on overleaf (especially in the data section)
[x] Ensure that the references in the .bib file compile in overleaf.

Correcting results error

[x] Change narrative beginning in introduction
[x] Look through literature on the economics of nepotism
[x] See if there is evidence for reliance on parents in general, and look for evidence for reliance on parents when times are tough.
[x] Add in those cool new statistical tests to show how awesome I am at finding fun new cool statistical tests
[x] Add in appropriate graphs.

I think going through the literature again is the most important part of telling the narrative, because then we could rely on some prior sources that make similar claims that we do (kids always rely on parents). And then of course making the results match up with the new analysis.

rpseely commented 3 months ago

Sunday, March 24, 2024

I looked for additional articles that could be used to reframe our narrative for the project. Had little success but found one article, "Like father, like son: Occupational choice, intergenerational persistence and misallocation" from Salvatore Lo Bello & Iacopo Morchio. Which offered some good material.
I am looking for more on specifically why children use their parents as resources and why they tend to mimic their parents.

rpseely commented 3 months ago

Monday & Tuesday, March 25 & 26, 2024

The Kramarz and Skans (2006) article that @kbuzard found on our zoom yesterday is incredibly helpful and just what I was looking for. I think there is definitely more I could find that would be helpful, but for now I am going to focus more so on making the changes. Add one to the win column!
@kbuzard Also, I assume I am good to cite nepotism literature from other fields, specifically social psychology? I believe I have a good article from a social psychologist that I would use to argue that desires to create nepotistic relationships (from the parent and from the child) are strong, such that differences in labor market competition are not strong enough to override that use of resources. I won't state it so strongly in the final draft, but that is the idea.

rpseely commented 3 months ago

Wednesday, March 27, 2024

Introduction Language

@kbuzard We make this statement in our introduction:

"This struggle for young professionals will incentivize them to use all of their resources to find a job. Therefore, we predict that there will be more graduates joining their parent’s occupations under these conditions."

My question is, is it appropriate to make this statement because it conflicts with our results? i.e. does it just add confusion, or is it notable that the results don't match our prediction.

Additionally, I updated the language in the introduction and abstract section to reflect that we can find no significant association.

Updating Data Section

Adding more explanation about why we chose to focus on those hired as young adults. I ran a logistic regression that showed with each year older someone was when hired, there was a 3.3 percent decrease in the likelihood that they would be a nepobaby. I added that in to the data section because I think it gives us a great rationale for why we focus on young adults because of our focus on nepotism.
Also made minor corrections to reflect the new narrative.

Results section

I updated the code in the nepobabies.do file to reflect the correct values of the quartiles without the 2022 data. I also added in the beta regression from a .log file where I explored the issues with the results.
I started updating the language in the results section to reflect a null result.

kbuzard commented 3 months ago

My question is, is it appropriate to make this statement because it conflicts with our results? i.e. does it just add confusion, or is it notable that the results don't match our prediction.

I think it's more honest to stick with your original hypothesis, and frankly, more interesting a story. I'd keep it!

kbuzard commented 3 months ago

@kbuzard Also, I assume I am good to cite nepotism literature from other fields, specifically social psychology? I believe I have a good article from a social psychologist that I would use to argue that desires to create nepotistic relationships (from the parent and from the child) are strong, such that differences in labor market competition are not strong enough to override that use of resources. I won't state it so strongly in the final draft, but that is the idea.

This is fine. These types of papers often have a few cites from related fields, just as many of political economy papers often have a few cites from political science journals.

rpseely commented 3 months ago

Sunday, March 31, 2024

Replicating beta regression

I fixed the way the beta regression was being performed and had some issues. I realzied that the way I defined the nepobaby ratio was not really doing what I wanted it to, so I redefined correctly (to be the ratio of nepobabies to nonnepobabies for each unemployment rate/hiremonth). I definitely want to come back to this, but for now I am just going to focus on re-writing the results section. I uploaded the code I used into the nepobabies.do file. The issue with replicating the beta regression, more specifically, was that I would get an error message saying there were no observations - this is obviously not true as I check it a bunch of times. I then ran a regular regression but the sum of squares for the residual was much much higher than that for the model (0.265 for the model, 61.307 for the residual). I am sure part of that was also using the regular regression and not the beta regression, and part of it is due to the fact that there is no significant association. Once again, I want to come back to this at some point.

I decided to come back to it and try again! I think it definitely worked, but with the newly, and I believe correctly, defined nepobaby_ratio variable, I got a significant result. However, the significant result is the opposite of what our prediction is (that with lower unemployment rates there are higher nepobaby ratios). Hmmm, definitely going to have to think about this more. Not sure what to do with this for now.

Here is the output:

@kbuzard I think that the beta regression is a cool idea, but I don't want to keep going with it if it's not really supporting me in my end goal. But I also don't want to ignore it if we think it's notable. I am also happy to meet if we want to go in depth. I really want to create a valid, replicable, accurate piece of research, but I don't want to spend time that might be more fruitful doing data entry. I'm also a little out of my depth with the regression analysis, but I do think that might give a more accurate result than a chi-square analysis that cuts the unemployment rates into only four groups (just from my understanding of regression being a more powerful tool for data analysis). Let me know what you think when you get the chance. I will keep on with the data entry until we can come up with a plan on how to get this project done well.

kbuzard commented 3 months ago

I think that the beta regression is a cool idea, but I don't want to keep going with it if it's not really supporting me in my end goal. But I also don't want to ignore it if we think it's notable. I am also happy to meet if we want to go in depth. I really want to create a valid, replicable, accurate piece of research, but I don't want to spend time that might be more fruitful doing data entry. I'm also a little out of my depth with the regression analysis, but I do think that might give a more accurate result than a chi-square analysis that cuts the unemployment rates into only four groups (just from my understanding of regression being a more powerful tool for data analysis). Let me know what you think when you get the chance. I will keep on with the data entry until we can come up with a plan on how to get this project done well.

My problem is that I don't know anything about beta regression. Maybe there's a simpler way: if you do a one-independent-variable regression (with no constant), the coefficient you get should be the correlation coefficient. So maybe just a pwcorr (with ,sig option) would do what you're hoping for?

I also find it a little hard to think about $\frac{no. \ nepobabies}{no. \ non-nepobabies}$; $\frac{no. \ nepobabies}{total workers}$ is more the way I'm used to seeing such ratios. It shouldn't affect the significance, but it has a more natural interpretation.

rpseely commented 3 months ago

Tuesday, April 2, 2024

Data Analysis (Issues Resolved!)

Fixing the ratio did the trick! Beforehand, when I would tabulate the ratio there would be plenty of values that didn't make sense. Now, they seem more like naturally occurring ratios and there are fewer values that have no observation for one side of the ratio or were essentially undefined.
I ran the correlation, as suggested, and there is a statistically significant correlation that is very weak (coefficient about 0.092). I think this will work well with the narrative.
I re-ran the beta regression with the correct ratio and the significant result disappeared, now the p-value is about 0.16, which certainly makes more sense conceptually. Additionally, I spoke to my stats. professor after class who gave me two resources to look at on how to interpret the results and write about them. I will look through those and then determine on whether or not I feel confident in putting those results in there.
I also uploaded a graph of the correlation with the correct variables, but it is not ready to be put into the paper.
I spent some time looking at how we could do the chi-square analysis with more naturally occurring groups instead of hard cutoff quartiles, but was not successful within 30 minutes. I decided it is more important to just get back into rewriting and getting the final report ready to be seen by others.

rpseely commented 2 months ago

Monday, April 8, 2024

Analysis Final Update (fingers crossed)

I did some looking into the beta regression and linear regression in general, and I think it would be best to leave those kinds of regression out of the results. Using some statistical tests, with the help of ChatGPT, and Stack Exchange, the characteristics of the data seem to significantly violate multiple assumptions of a regression test. Obviously, that is a big red flag! So I will be dropping those tests from the results and the do-file.
The only regression I will be using is a logistic regression between the nepobaby variable and the unemployrate variable. This has model has a p-value of 0.408, and I feel much more comfortable writing about it than I would the linear regression and especially the beta regression. Additionally, I checked the logistic regression model and it held up to the assumptions of the model. Wonderful! I had to rewrite this code because of an issue with the remote desktop service so that took me a little extra time.
I also produced a graphic from the model that I shows the relationship between nepobaby probability and unemployment rate and uploaded it to GitHub, and I definitely think it should go in the final report.
Now, I feel very confident in the results being in a great spot to just focus on writing.

rpseely commented 2 months ago

Tuesday, April 16, 2024

Do-File

The do-file should be all set in terms of all the code running correctly. It might need a few extra comments, but one should be able to run the do-file in one go (if they correct the file paths). Hooray!
I also uploaded a log of all the code into the code section on GitHub and it is titled nepobabies.log

rpseely commented 2 months ago

Wednesday, April 17, 2024

Results Additions

I added the results for the logistic regression and the correlation after doing some research on how to best talk about those models/tests as they were not something I was familiar with writing about.
The only writing left for the results section should be the sensitivity analysis. Then adding in the graphs and comments about the graphs, but that will be an overleaf thing.

rpseely commented 2 months ago

Tuesday, April 24, 2024

Writing Updates

I edited the conclusion to reflect the new result.
I edited the discussion to reflect the new result.
I uploaded new graphs, after working through some issues in stata to produce them, that will be incorporate into the paper.
I edited the literature review to include the new literature we use to explain why there may not be great variation in rates of nepotism across the business cycle. I then worked with some issues in overleaf/the .bib file to include the new citations.
I made the above edits to the overleaf file, except for the bar graphs. The bar graph that pairs with the chi-square analysis does not match up exactly with the table I have created, so I need to double check that.

rpseely commented 1 month ago

Saturday-Monday, May 13

Writing Updates

I added the sensitivity analysis into the paper. I placed it after the original chi-square analysis and briefly explained the reason why a sensitivity analysis was appropriate, referencing the issues mentioned in the data section.
I added the corrected graph for the nepotism ratios fixing the above issue mentioned in my last post. The problem was that the ratios in the graph used this definition for the ratio: $\frac{no. \ nepobabies}{no. \ non-nepobabies}$, whereas the way I was calculating it was using this ratio: ; $\frac{no. \ nepobabies}{total workers}$. The ratios in the graph now reflect the results from the latter ratio, which is also the same was I created the nepobaby_ratio variable that is used in the correlation and the logistic regression.
I think this addresses the remaining issues with the paper so I am going to read through it a couple times now.

rpseely commented 1 month ago

Sunday May 19

Writing Updates

I went through the notes on my iPad and made some corrections, some additions, and some deletions to the .tex file.
There is only one more issue of substance (that I know of!) and that is a significant result from the chi-square analysis on education and nepotism status. I have to go through that and see what the results are because, just from the results (the table) I cannot determine what is significant about the association because there are 20 categories.

rpseely commented 1 month ago

Monday, May 20

Last substantive issue (hopefully) resolved!

The significant result from the chi-square analysis on the association between years of education and nepotism status, p = 0.099, left me unsure of where to go because it was not immediately obvious in what direction the association was significant, i.e. which group was more likely to be more highly educated, where they differed, etc.
Today I used some graphical representations to examine where the differences were and it likewise was unclear where there was any significant difference between the two groups. Specifically, their histograms looked largely the same.
I then created a variable for non-nepobaby education and nepobaby education (in years) where the other nepotism status had missing values and performed an unpaired two-sample t-test. The p-value was p = 0.16 for difference > 0 so I decided that it was not worth including in the final report about any significant association. I feel this is appropriate because the chi-square analysis was only significant at the α = 0.90 level and I have not used a significance level that low anywhere else in the analysis.

kbuzard commented 1 month ago

@rpseely I think you can report this null result. I wouldn't make a big deal of it, but a very short paragraph that says something like, "One might expect that the reliance on ..... would differ across ....."

Here I would just state clearly what test you ran and say that this bears a more detailed analysis (maybe it should be cut into different bins, maybe a correlation coefficient, etc.

rpseely commented 1 month ago

Friday, May 24

Update and Recompile Overleaf .tex File

I reconciled differences between the .tex file on Overleaf and the .tex file on GitHub so that the Overleaf file is completely up to date. There are just a few spots where the one on GitHub is a little behind, particularly with the graphs, and I will fix that once I go over the file for the last time.
I made edits to the graphs using Stata and removed the titles from the Stata graphics as they were redundant with the titles that followed "Figure X:" and uploaded the corrected graphs to the Overleaf file. I also corrected the axis titles and legend for the scatterplot graph.
Next steps: one last read through!

rpseely commented 1 month ago

Sunday, May 26

Final Edits (up until results)

Today I went through the paper up until the results section. I made some minor changes here and there, and added a sentence in a few places to tie things together more neatly, and removed a few lines from the lit. review that were really not relevant to the project.
Thank you to my mom for giving it a read through and finding some errors!

rpseely commented 1 month ago

Tuesday, May 28

Uploaded Final Draft for Review

@kbuzard I have uploaded the final draft! I finished going through the draft to make edits and I have uploaded both a .tex file and compiled the report into a PDF as well and uploaded that.

kbuzard commented 1 month ago

@rpseely Great! I'll see if I can carve out time to read it over tomorrow! Thanks so much!

kbuzard commented 1 month ago

@rpseely I've just read through the report, and I think it's very, very close to the finish line. I've made some minor comments throughout the PDF (I just uploaded a copy with my initials attached). It really needs a pointer to your reproducibility package in a data appendix or either the data or results section. And there are some inconsistencies in the story (in various places you say very different things about how much support you find for/against the null hypothesis). I think my comments on the draft can be addressed in an hour or less. Another hour or two shaping up the reproducibility package would also be great (see notes on draft).

rpseely commented 1 month ago

@kbuzard I am not sure what happened but for some reason I uploaded the updated .tex file and a .pdf file that was out of date, even though I thought I got them both from overleaf together? In the updated file, I have those inconsistencies addressed, as well as other minor changes but nothing else that is substantively different, so the rest of the comments are certainly applicable.

Also, I ran some quick tests on whether the low unemployment rate group was significantly different from the three higher groups and whether it was significantly different from just the higher group and there was not enough evidence to reject H₀ for either test (see no.diff_highlow.urate.log).

kbuzard commented 4 weeks ago

@kbuzard I am not sure what happened but for some reason I uploaded the updated .tex file and a .pdf file that was out of date, even though I thought I got them both from overleaf together? In the updated file, I have those inconsistencies addressed, as well as other minor changes but nothing else that is substantively different, so the rest of the comments are certainly applicable.

@rpseely Maybe we could schedule a Zoom once the project is wrapped and you could help me brainstorm about ways to make the workflow easier for this coming fall's version of 310? I'm considering paying for a premium Overleaf account so that people can use the integration with Github, but I've never tried it so I don't know if it will be easier or harder than what we did this fall.

Also, I ran some quick tests on whether the low unemployment rate group was significantly different from the three higher groups and whether it was significantly different from just the higher group and there was not enough evidence to reject H0 for either test (see no.diff_highlow.urate.log).

I suggest adding a sentence or two about these new tests, giving the p-values and saying that they are not significant at conventional levels but are close.

rpseely commented 4 weeks ago

Monday, June 3, 2024

Changes made after comments

I changed the way that I wrote about the nepobaby and parent gender relationships and created a new bar graph to reflect this language and more accurately reflect the content of the data I have.
I added in language about the two new chi-square tests that show no significant association between nepotism status and unemployment level for bottom 25% to top 75% and bottom 25% to top 25% (at conventional levels).
Various parts where there were minor errors/changes in language. I want to note that I, importantly, corrected the correlation coefficient from 0.979 to 0.0979.
Confusing phrasing in lit review...

From Annotated PDF:

Women, who are described as risk-averse by Hellerstein and Morrill (2011) may follow into their parents professions, specifically their father’s occupation to prevent the risk of being jobless after their graduation. While we do not present evidence on this trend, our finding that approximately ten percent of all young adult workers surveyed are defined as nepobabies demonstrates that a large portion of the workforce utilizes their familial relationships to establish careers and does provide evidence of this potentially risk-adverse behavior.

I have deleted the bolded and italicized portion. I was intending to reference the growing trend of daughters working in their fathers profession (from Hellerstein and Morrill), but a) this does not document a theory and b) we do not analyze the changes in parent-child nepotism over time.

Changes made before comments (not part of annotated pdf)

The abstract was largely rewritten after the second sentence.
Paragraph 4 of results: There were a number of errors/confusing phrasing that were commented on that I had altered. This is also where I put the writing on the new chi-square tests.

Question

@kbuzard There is a portion of writing in the data section that addresses how we determined to subset the data in terms of age, specifically it includes notes on a logistic regression and chi-square analysis. My thought in putting that in the data section was to explain why I chose to only focus on the data from observations in which the respondent was younger than 30. Here is my idea: copy and paste the section you highlighted somewhere into the beginning of the results section. Then, add a sentence in the data section where that text was and say something like "we choose to focus on adults younger than 30 and you can read all about why in \ref{sec:result}" so that one can jump to the explanation of why we made that cutoff. Would that be more appropriate?

Next Steps

I believe all of the comments have been addressed so now I will focus on the reproducibility package.

rpseely commented 4 weeks ago

@rpseely Maybe we could schedule a Zoom once the project is wrapped and you could help me brainstorm about ways to make the workflow easier for this coming fall's version of 310? I'm considering paying for a premium Overleaf account so that people can use the integration with Github, but I've never tried it so I don't know if it will be easier or harder than what we did this fall.

@kbuzard I would be happy to!

ecn310 / course-project-nepobabies