Issue with dataset - Cannot find variables to run linear regression

iciarfernandez commented 4 years ago

Hi @yuliaUU @firasm,

I'm getting quite frustrated with finding variables that make sense to run linear regression on for our chosen dataset; I realize that this is a bit of an open question, but essentially the issue is that the majority of our variables are categorical. I have converted those that have yes/no answers into dummy variables - for example, "Do you study outside of your home country?", the answer being Yes/No, to 0/1 (0=yes, 1=no), and we have also changed some of those where answers were in a scale of "Strongly agree" to "Strongly disagree" into numbers from 1-5 or from 1-7 depending on each case.

However, when trying to run linear regression on any of these, my plots look awful and I'm not sure if there is an alternative to linear regression that we can do to analyse these variables, or whether I am doing something wrong. I was initially interested in investigating the relationship between a university having a long hours culture (scale answers from 1 - strongly disagree to 5 - strongly agree) and student having seeked help for anxiety and depression caused by their PhD (dummy variable changed from yes/no/prefer not to say to 0/1/2), and coloring the plot by whether students are studying in their home country or not (yes/no). This all makes sense in my head, but I literally have no clue of how to go about it. However, this is my code for that & I have attached what the plot looks like:

  ##### Create Dummy Variables
  survey_data = data %>%
    mutate(studying_in_your_home_country = ifelse(as.character(studying_in_your_home_country) == "Yes", "0", as.factor(studying_in_your_home_country))) %>%
    # 0 is yes and 1 is no
    mutate(have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD = ifelse(as.character(have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD) == "Yes", "0", as.factor(have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD))) %>%
    # 0 is yes, 1 is no and 2 is prefer not to say
    select(-X1) # drop this column, prob was an error from cleaning

  ##### Model
  model<- lm(survey_data$university_has_long_hours_culture ~ survey_data$have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD)
  saveRDS(model, here::here(glue::glue(path, "model.rds")))

  ##### Plot
  plot = survey_data %>%
    ggplot(aes(x= university_has_long_hours_culture,y=have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD))+geom_smooth(method = "lm")+  theme_bw() + geom_point()+
    xlab("University has long hours culture") +
    ylab("Have you sought help for anxiety or depression caused by your PhD?")

ggplot-1

I thought it may look like that because of the dummy variable, but when I try plotting 2 numerical variables instead (2 variables that have a 1-5/1-7 scale), I still get this:

survey_data %>%
  ggplot(aes(x= university_has_long_hours_culture,y=have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD, color=studying_in_your_home_country))+geom_smooth(method = "lm")+  theme_bw() + geom_point()+
  xlab("University has long hours culture") +
  ylab("Have you sought help for anxiety or depression caused by your PhD?")

ggplot-2

Apologies for the really long issue, I just feel really stuck with this and don't know how to move forward. Thank you so much for your help!!

yuliaUU commented 4 years ago

@sciclic does not look like you have a correlation between those things. Can you run a pairwise comparison for your dataset to see which variables may be correlated with each other? something similar to this

iciarfernandez commented 4 years ago

Okay let me give that a try and I'll get back to you. Thanks @yuliaUU!

iciarfernandez commented 4 years ago

I tried running the ggpairs function ONLY on non-categorical variables, i.e those that answer with either a scale, or with dummy variables that I have created from yes/no answers, and I am having a really hard time reading the output still.

survey_data <- readr::read_csv(here::here("Data", "survey_data.csv"))

survey_data <- survey_data %>%
  mutate(studying_in_your_home_country = ifelse(as.character(studying_in_your_home_country) == "Yes", "0", as.factor(studying_in_your_home_country))) %>%
  # 0 is yes and 1 is no
  mutate(have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD = ifelse(as.character(have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD) == "Yes", "0", as.factor(have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD))) %>%
  # 0 is yes, 1 is no and 2 is prefer not to say
  select(-X1) %>% # drop this column, prob was an error from cleaning
  mutate(experienced_discrimination_or_harrasment = ifelse(as.character(experienced_discrimination_or_harrasment) == "Yes", "0", as.factor(experienced_discrimination_or_harrasment))) %>%
  mutate(experienced_bullying_in_PhD = ifelse(as.character(experienced_bullying_in_PhD) == "Yes", "0", as.factor(experienced_bullying_in_PhD)))

data_to_plot <- survey_data %>% select(c(studying_in_your_home_country, 
                         level_of_satisfaction_with_decision_to_pursue_a_PhD, 
                         supervisor_relationship,
                         work_life_balance,
                         university_supports_work_life_balance,
                         university_has_long_hours_culture,
                         have_you_sought_help_for_anxiety_or_depression_caused_by_your_PhD,
                         mental_health_and_wellbeing_services_at_my_uni_are_appropriate_to_PhD_students_needs,
                         supervisor_awareness_of_mental_health_services,
                         university_offers_adequate_one_to_one_mental_health_support,
                         university_offers_a_variety_of_support_resources,
                         experienced_discrimination_or_harrasment,
                         experienced_bullying_in_PhD))

GGally::ggpairs(data_to_plot, aes(colour = studying_in_your_home_country, alpha = 0.4))

I don't know if I need to modify the code in any way to make it easier to read? ggpairs

iciarfernandez commented 4 years ago

Okay, I tried using ggpairs with only 4 variables at a time and also adjusting the font size so that it is more easily readable and this is the result. I'm still not sure of how to interpret the plot; can I run linear regression on any of these variables?

test1 <- survey_data %>% select(c(studying_in_your_home_country, 
                                  level_of_satisfaction_with_decision_to_pursue_a_PhD, 
                                  supervisor_relationship,
                                  work_life_balance,
                                  university_has_long_hours_culture)) %>% rename(c("studying_in_your_home_country" = "studyinghomecountry",
                                                                                 "level_of_satisfaction_with_decision_to_pursue_a_PhD" = "PhDsatisfaction",
                                                                                 "supervisor_relationship" = "supervisorrelationship",
                                                                                 "work_life_balance" = "workvslife",
                                                                                 "university_has_long_hours_culture" = "longhours"))

GGally::ggpairs(test1, aes(colour = studyinghomecountry, alpha = 0.05, params=list(corSize=1)))

ggpairs2

iciarfernandez commented 4 years ago

P.S I was thinking alternatively, in the case that I cannot run linear regression, would it be possible to plot my data in other way instead? For example, I think that doing two bar graphs where in (1) the x axis is the long hours culture variable, the y axis is the number of people and I can color the bars by yes/no/prefer not to say to the "studying outside of your home country question", and (2) where the x axis is the having seeked help for anxiety and/or depression and the rest would be the same.

I don't know. Please let me know what you think, as I realize one of the milestone requirements is running linear regression...

iciarfernandez commented 4 years ago

@yuliaUU @firasm - Sorry, I forgot to tag you! Not sure if you get email notifications otherwise.

firasm commented 4 years ago

Hi @sciclic, what you would need to do for your analysis is a logistic regression. It's slightly more complex, but conceptually fairly easy to understand.

Here's a quick tutorial on how to do this. I tried to find one where your data closely matches.

Let me know if that helps!

firasm commented 4 years ago

And here is a slightly more detailed description of what exactly logistic regression is.

iciarfernandez commented 4 years ago

Thank you @firasm !!! I think I've seen logistic regression before and I understand the explanation. When I run the glm() function, the summary stats make sense (I think) but my plot is still looking the same with the 'glm' method. Does that just mean there is no correlation between the variables I am plotting?

My code

survey_hypothesis <- glm(as.factor(university_has_long_hours_culture) ~ as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD), data = survey_data, family = binomial)
summary(survey_hypothesis)

ggplot(survey_data, aes(university_has_long_hours_culture, level_of_satisfaction_with_decision_to_pursue_a_PhD)) +
  geom_point() +
  geom_smooth(method = "glm",
              method.args = list(family = "binomial"),
              se = FALSE)

Output summary stats

Call:
glm(formula = as.factor(university_has_long_hours_culture) ~ 
    as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD), 
    family = binomial, data = survey_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3000   0.4279   0.4287   0.5346   0.5553  

Coefficients:
                                                                Estimate
(Intercept)                                                      1.79176
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)2  0.77958
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)3  0.55325
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)4  0.54917
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)5  0.08142
                                                                Std. Error
(Intercept)                                                        0.14183
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)2    0.20625
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)3    0.19591
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)4    0.15842
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)5    0.15345
                                                                z value
(Intercept)                                                      12.633
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)2   3.780
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)3   2.824
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)4   3.466
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)5   0.531
                                                                Pr(>|z|)
(Intercept)                                                      < 2e-16
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)2 0.000157
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)3 0.004742
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)4 0.000527
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)5 0.595731

(Intercept)                                                     ***
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)2 ***
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)3 ** 
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)4 ***
as.factor(level_of_satisfaction_with_decision_to_pursue_a_PhD)5    
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4602.2  on 6796  degrees of freedom
Residual deviance: 4556.9  on 6792  degrees of freedom
  (15 observations deleted due to missingness)
AIC: 4566.9

Number of Fisher Scoring iterations: 5

Output plot

glm-plot

iciarfernandez commented 4 years ago

@firasm, I think I figured it out following the exercise at the bottom with the mental_health dataset. Thank you! So to be clear, is it okay to run this type of analysis in place of linear regression as the requirement for milestone 3?

firasm commented 4 years ago

Yes! The requirement is only to run an analysis (minimum complexity: linear regression) and "report" on the results of your analysis in your final document using inline like Hayley showed on Thursday. Logistic regression is perfectly fine as well.

Re: your plot, I think you have a case of over-plotting. A whole bunch of points are being plotted on top of each other. You should consider jitter (both in x and y) as well as an alpha transparency to reduce that visual effect. Also, please do update your x and y axis labels so they're a bit more informative and cleaner.

STAT547-UBC-2019-20 / group05