Closed iciarfernandez closed 4 years ago
@sciclic does not look like you have a correlation between those things. Can you run a pairwise comparison for your dataset to see which variables may be correlated with each other? something similar to this
Okay let me give that a try and I'll get back to you. Thanks @yuliaUU!
I tried running the ggpairs function ONLY on non-categorical variables, i.e those that answer with either a scale, or with dummy variables that I have created from yes/no answers, and I am having a really hard time reading the output still.
survey_data <- readr::read_csv(here::here("Data", "survey_data.csv"))
survey_data <- survey_data %>%
mutate(studying_in_your_home_country = ifelse(as.character(studying_in_your_home_country) == "Yes", "0", as.factor(studying_in_your_home_country))) %>%
# 0 is yes and 1 is no
mutate(have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD = ifelse(as.character(have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD) == "Yes", "0", as.factor(have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD))) %>%
# 0 is yes, 1 is no and 2 is prefer not to say
select(-X1) %>% # drop this column, prob was an error from cleaning
mutate(experienced_discrimination_or_harrasment = ifelse(as.character(experienced_discrimination_or_harrasment) == "Yes", "0", as.factor(experienced_discrimination_or_harrasment))) %>%
mutate(experienced_bullying_in_PhD = ifelse(as.character(experienced_bullying_in_PhD) == "Yes", "0", as.factor(experienced_bullying_in_PhD)))
data_to_plot <- survey_data %>% select(c(studying_in_your_home_country,
level_of_satisfaction_with_decision_to_pursue_a_PhD,
supervisor_relationship,
work_life_balance,
university_supports_work_life_balance,
university_has_long_hours_culture,
have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD,
mental_health_and_wellbeing_services_at_my_uni_are_appropriate_to_PhD_students_needs,
supervisor_awareness_of_mental_health_services,
university_offers_adequate_one_to_one_mental_health_support,
university_offers_a_variety_of_support_resources,
experienced_discrimination_or_harrasment,
experienced_bullying_in_PhD))
GGally::ggpairs(data_to_plot, aes(colour = studying_in_your_home_country, alpha = 0.4))
I don't know if I need to modify the code in any way to make it easier to read?
Okay, I tried using ggpairs with only 4 variables at a time and also adjusting the font size so that it is more easily readable and this is the result. I'm still not sure of how to interpret the plot; can I run linear regression on any of these variables?
test1 <- survey_data %>% select(c(studying_in_your_home_country,
level_of_satisfaction_with_decision_to_pursue_a_PhD,
supervisor_relationship,
work_life_balance,
university_has_long_hours_culture)) %>% rename(c("studying_in_your_home_country" = "studyinghomecountry",
"level_of_satisfaction_with_decision_to_pursue_a_PhD" = "PhDsatisfaction",
"supervisor_relationship" = "supervisorrelationship",
"work_life_balance" = "workvslife",
"university_has_long_hours_culture" = "longhours"))
GGally::ggpairs(test1, aes(colour = studyinghomecountry, alpha = 0.05, params=list(corSize=1)))
P.S I was thinking alternatively, in the case that I cannot run linear regression, would it be possible to plot my data in other way instead? For example, I think that doing two bar graphs where in (1) the x axis is the long hours culture variable, the y axis is the number of people and I can color the bars by yes/no/prefer not to say to the "studying outside of your home country question", and (2) where the x axis is the having seeked help for anxiety and/or depression and the rest would be the same.
I don't know. Please let me know what you think, as I realize one of the milestone requirements is running linear regression...
@yuliaUU @firasm - Sorry, I forgot to tag you! Not sure if you get email notifications otherwise.
Hi @sciclic, what you would need to do for your analysis is a logistic regression. It's slightly more complex, but conceptually fairly easy to understand.
Here's a quick tutorial on how to do this. I tried to find one where your data closely matches.
Let me know if that helps!
And here is a slightly more detailed description of what exactly logistic regression is.
Thank you @firasm !!! I think I've seen logistic regression before and I understand the explanation. When I run the glm() function, the summary stats make sense (I think) but my plot is still looking the same with the 'glm' method. Does that just mean there is no correlation between the variables I am plotting?
survey_hypothesis <- glm(as.factor(university_has_long_hours_culture) ~ as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD), data = survey_data, family = binomial)
summary(survey_hypothesis)
ggplot(survey_data, aes(university_has_long_hours_culture, level_of_satisfaction_with_decision_to_pursue_a_PhD)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE)
Call:
glm(formula = as.factor(university_has_long_hours_culture) ~
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD),
family = binomial, data = survey_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3000 0.4279 0.4287 0.5346 0.5553
Coefficients:
Estimate
(Intercept) 1.79176
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)2 0.77958
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)3 0.55325
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)4 0.54917
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)5 0.08142
Std. Error
(Intercept) 0.14183
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)2 0.20625
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)3 0.19591
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)4 0.15842
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)5 0.15345
z value
(Intercept) 12.633
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)2 3.780
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)3 2.824
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)4 3.466
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)5 0.531
Pr(>|z|)
(Intercept) < 2e-16
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)2 0.000157
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)3 0.004742
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)4 0.000527
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)5 0.595731
(Intercept) ***
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)2 ***
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)3 **
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)4 ***
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)5
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4602.2 on 6796 degrees of freedom
Residual deviance: 4556.9 on 6792 degrees of freedom
(15 observations deleted due to missingness)
AIC: 4566.9
Number of Fisher Scoring iterations: 5
@firasm, I think I figured it out following the exercise at the bottom with the mental_health dataset. Thank you! So to be clear, is it okay to run this type of analysis in place of linear regression as the requirement for milestone 3?
Yes! The requirement is only to run an analysis (minimum complexity: linear regression) and "report" on the results of your analysis in your final document using inline like Hayley showed on Thursday. Logistic regression is perfectly fine as well.
Re: your plot, I think you have a case of over-plotting. A whole bunch of points are being plotted on top of each other. You should consider jitter (both in x and y) as well as an alpha transparency to reduce that visual effect. Also, please do update your x and y axis labels so they're a bit more informative and cleaner.
Hi @yuliaUU @firasm,
I'm getting quite frustrated with finding variables that make sense to run linear regression on for our chosen dataset; I realize that this is a bit of an open question, but essentially the issue is that the majority of our variables are categorical. I have converted those that have yes/no answers into dummy variables - for example, "Do you study outside of your home country?", the answer being Yes/No, to 0/1 (0=yes, 1=no), and we have also changed some of those where answers were in a scale of "Strongly agree" to "Strongly disagree" into numbers from 1-5 or from 1-7 depending on each case.
However, when trying to run linear regression on any of these, my plots look awful and I'm not sure if there is an alternative to linear regression that we can do to analyse these variables, or whether I am doing something wrong. I was initially interested in investigating the relationship between a university having a long hours culture (scale answers from 1 - strongly disagree to 5 - strongly agree) and student having seeked help for anxiety and depression caused by their PhD (dummy variable changed from yes/no/prefer not to say to 0/1/2), and coloring the plot by whether students are studying in their home country or not (yes/no). This all makes sense in my head, but I literally have no clue of how to go about it. However, this is my code for that & I have attached what the plot looks like:
I thought it may look like that because of the dummy variable, but when I try plotting 2 numerical variables instead (2 variables that have a 1-5/1-7 scale), I still get this:
Apologies for the really long issue, I just feel really stuck with this and don't know how to move forward. Thank you so much for your help!!