ahurford / pandemic-COVID-zero

Script Files To Download and Model Travel-Related Cases To Newfoundland From Across Canada Provinces
0 stars 0 forks source link

To do #1

Open ahurford opened 2 years ago

ahurford commented 2 years ago

Hi @francisanokye - thanks for this. I cleaned up the data cleaning for the travel-related cases. There are some instructions in the middle of that "data cleaning" file - are you able to do something similar for active cases from the PHAC data file? Sorry, about my messy, messy coding - yikes! (getting less terrible, I think...) thank you!

ahurford commented 2 years ago

Hi @francisanokye - I am going to work on this today - NO NEED TO WORRY ABOUT THE COMMENT ABOVE (I'll do it) - I'll let you know what to do next after I get done my work today. Looks good what you've done.

francisanokye commented 2 years ago

@ahurford sorry I just saw the messages. I will be watching this space then. Thanks!

ahurford commented 2 years ago

@francisanokye no, no, worries - I'm just trying to figure out what to do when. The data cleaning file, I am mostly done with, but I noticed that one of the data files (the NL, NLCHI one) had some problems with it. All the cleaned data except for NL.travel.2 should be okay.

Do you want to try and see if you can get the model selection part done - if the variable and file names are now completely different - sorry! I feel like this is a lot cleaner now, though...

francisanokye commented 2 years ago

@ahurford ... sure! i just noticed the changes in the data cleaning file.

ahurford commented 2 years ago

@francisanokye the problem with your fits to the travel-related data is that you need to exponentiate: exp(on x ON+ab x AB+ns x NS)

When you fit glm() with poisson errors it takes the logarithm of the response variable, so if you want to plot the response variable you need to exponentiate the other side of the equation. I hope this makes sense!

francisanokye commented 2 years ago

@ahurford this makes sense, thank you for the explanation. My other challenge is that the active data set has 48 observations, but surprisingly it's appearing as 122 when loaded causing a mismatch with the travel data.

ahurford commented 2 years ago

@francisanokye the file travel.csv has both the travel-related cases and the active cases. The CCODWG data for travel-related cases ends on May 30, 2021, so for the file travel.csv, which is useful to do the glm(), I filtered the data so there was nothing after May 30, 2021 - but NS.active, ON.active, etc are columns in travel.csv, that you can use in the glm.

The file active.csv runs up to June 3, 2022 or so, which is useful if we want to make predictions with the fitted model.

Sorry - I should have explained that!

ahurford commented 2 years ago

Hi @francisanokye I'm doing a bit more work on the data cleaning now - should be finished in ~ 1 hour

francisanokye commented 2 years ago

@ahurford please the following are the columns in the travel.csv file: "NL_travel" , "NL_contacts", "NL_new_cases", "NS_travel" etc. NS.active, ON.active, etc are rather variables in the active.csv file. what I am doing now is to subset the dates from '2020-07-05' up to '2021-05-30' in the active.csv file so I can have access to the active cases to be of same number of weeks.

ahurford commented 2 years ago

@francisanokye that's already been done. Those columns already exist in travel.csv

ahurford commented 2 years ago

@francisanokye I am done with the data cleaning file now. If you look at the very bottom where the commands write.csv() are you will see a description of what the files are for.

What I remembered is that for the glm() it makes more sense to use 7-day rolling average of travel-related cases than it does to used weekly travel-related cases. In the file travel.csv you will will NL_av_7. That should be the response variable for the glm()'s and the explanatory variables are ON_active, QC_active, etc, which are found in that file too.

francisanokye commented 2 years ago

@ahurford thank you, I'm checking it out now.

ahurford commented 2 years ago

@francisanokye for Figure 1 the top panels: graphs of close contact, travel-related, all new cases, with time on the x-axis, for Atlantic Canada & the territories use travel.wk.csv because I think this will look better is aggregated by week.

For the lower panels in that figure (i.e. boxplots of % travel-related cases each day) you will need to make another dataset that is similar to travel.wk.csv (ie. including the territories and close contacts) but is aggregated by report_date not report_week. We can talk when you get up to this though.

ahurford commented 2 years ago

@francisanokye regarding the validation for the NL data - I think I have essentially done this. I thought what I was doing was working on Figure 3 - but it turns out, I think this will be good as the figure for validation for NL.

francisanokye commented 2 years ago

@ahurford sorry for getting back late, I am stuck on the models using the NL_av_7 as the response variable and have been trying to rectify it, but to no avail. The AIC from the models are Infinite values because the Poisson assumes that the response values are non-negative whole numbers, which are not so in the travel.csv data. The attached images are my output sfor the Figure down panel G and I, but the others do not seem to conform to what is show in the paper

Screen Shot 2022-06-16 at 10 19 54 AM Screen Shot 2022-06-16 at 10 20 21 AM

.

ahurford commented 2 years ago

Thanks for spotting that problem @francisanokye - you are absolutely right about the Poisson distribution and non-integer values. You should have the response variable as the number of travel-related cases, which is also in that same file travel.csv. I think it works like this, ie.

CCODWG = read.csv('~/Desktop/Work/Research/Research_Projects/2022/reopening/pandemic-COVID-zero/data/travel.csv')[,-1]

NL_travel = CCODWG$NL_travel ON2 = CCODWG$ON_active AB2 = CCODWG$AB_active NS2 = CCODWG$NS_active

mod = glm(NL_travel ~ 0+ON2+AB2+NS2, family = "poisson")

ahurford commented 2 years ago

The boxplot looks nice. One thing is that I have been trying to keep the color-to-province map consistent. Your colours in the contact per travel case are nice, but I do think it helps to always have each province the same color (even though this is painful to code).

If the others are not consistent - is it a problem with the original code? One of the post interesting panels I thought was total number of travel-related cases reported vs. median % (travel-related + contacts)/all cases.

ahurford commented 2 years ago

Some of the differences could be because I switched the data source for "new cases". It used to be CCODWG but I changed it to PHAC - this could mean the number of travel related cases exceeds all cases - if the data sources are not consistent

francisanokye commented 2 years ago

@ahurford alright, I will try them out with this feedback. Thank you.

francisanokye commented 2 years ago

@ahurford have updated the model selection file.

francisanokye commented 2 years ago

Hi Amy, please, my plots are struggling to conform to the ones depicted in the reopening paper. Here is a shot of what I am getting even after trying fits with different provinces as predictors

Screen Shot 2022-06-23 at 11 42 09 PM

.

ahurford commented 2 years ago

Hi @francisanokye - your work looks good.

  1. Why are the first few values for the fit for all the graphs 0? This seems strange.
  2. There are 5 explanatory variables, BC, ON, QC, AC, and PP, where PP is the prairie province (SK, MB, or AB) that fits best (you did this part correctly), and AC is NS, except for NS.travel, and then AC is NB. Therefore, for NL you would not have NB as an option.
  3. Look at the significance for the coefficients (something like summary(mod) or coef(mod)). If, for even the best model, none of the coefficients are significant at alpha = 0.05, then conclude that there is no relation and do not plot a best fit line. I think for PEI this is the case.
  4. I think we should continue building models and do 4 and 5 variable combinations. What you have done looks nice! Thanks for this.
ahurford commented 2 years ago

@francisanokye also, if there are any files that are no longer relevant, can we try to delete these? Thanks.

francisanokye commented 2 years ago
Screen Shot 2022-06-27 at 11 08 27 AM

Hi @ahurford this is the new plot I am getting after using the best combinations from the permutations. The Model_Permutaions.R uses a python script (permute_variables.py: this file must be in the same directory as the R file and then import the Reticulate library in R to be able to run the python script in R) to generate all the permutations and then fit the models.

The significant level of 0.05 is used to output only the variables that are considered significant as advised.

francisanokye commented 2 years ago
Screen Shot 2022-06-27 at 11 41 24 AM

Using the patchwork library isn't working for me either when it comes to putting all the plots on the same figure. I have tried plot_grid as well but the layout isn't good enough.

francisanokye commented 2 years ago
Screen Shot 2022-06-27 at 11 44 56 AM

Regarding this image (Bottom right: A figure of validation of the CCOWDG data against the Bilal data). I don't know if this plot answers the image request.

ahurford commented 2 years ago

That looks good @francisanokye thank you! Since patchwork works for me, I will put that in the code to get it working. Goodness me, that NB travel data is periodic with period one month! If you are able to work on Figure 1 now, that would be super - but if you are tired please feel free to take a break!

ahurford commented 2 years ago

Hi @francisanokye -

I don't know if maybe my updates aren't running correctly...

I ran Model_Permutations.R, but I thought I get this:

Screen Shot 2022-06-27 at 9 31 49 PM

I thought we were just doing the best Prairie province, but this model has MB, AB, and SK? I thought we were just doing whichever of those was the best predictor for the single variable model. Also, I think I forgot to mention this, but the estimated coefficients should be positive - does this approach work: https://www.rdocumentation.org/packages/zetadiv/versions/1.2.0/topics/glm.cons

ahurford commented 2 years ago

Also when run for NL it considers both NS and NB? I though we were just doing NS for all province except NS for which we were having NB as the explanatory predictor variable?

ahurford commented 2 years ago

This is what I thought we were doing:

Screen Shot 2022-06-27 at 10 07 42 PM

Sorry if it is confusing...

francisanokye commented 2 years ago

@ahurford sorry I included the other provinces after the permutation. I have reverted the code to our initial methodology and it works fine with the permutations. Trying out the constraints on the estimates and will update soon.

ahurford commented 2 years ago

@francisanokye I made a bad commit - I tried to revert it, but I'm not sure if I have a bad model_permutations.R file still there - whatever you are doing now (getting the coefficients to be positive) - just ignore the version of model_permutations.R on the main branch and commit yours over top. Sorry!

francisanokye commented 2 years ago

@ahurford I am working on getting the coefficients to be positive. However, it appears all the other provinces will have zero (0) significant estimates except for NL which has NS significant and positive estimate. In that case, all the model fits will just be a constant line. On the file, I renamed it to model_permutations.R with all the permutation models to capture the exclusions as advised. I will upload again shortly with the constraints on the estimates as well.

ahurford commented 2 years ago

@francisanokye sounds good - can you just select the best model with AIC, and print out the p-values on the coefficients for the best model (we are still doing all the 1-variable, 2-variable, all the way up to 5-variable models, right - and choosing the best (out of all those models) with AIC)? Thanks for you work - it is super!

francisanokye commented 2 years ago

@ahurford after using the glm.cons model throughout the entire modelling task, the significant variables with respect to each Atlantic province is indicated as follows (Atlantic province in bold and significant variable as row name for coefficients): NL: Estimate Std. Error z value Pr(>|z|) NS 0.1048925 0.006975344 15.03761 4.163401e-51

NS: Estimate Std. Error z value Pr(>|z|) ON 0.01109972 0.003998763 2.775789 0.0055068

NB: Estimate Std. Error z value Pr(>|z|) NS 0.04238019 0.01086383 3.901034 9.57825e-05

PEI: Estimate Std. Error z value Pr(>|z|) NIL 0.0 0.0 0.0 0.0

The challenge with this is that the model fit isn't as great as what we had earlier on with the glm ie without constraints on the estimates.

I have uploaded the code as well to the repo.

ahurford commented 2 years ago

Thanks @francisanokye - I think I can take it from here for Figure 2 - probably what makes sense is to only fit NL - I need to rethink how we're doing this analysis.

Are you able to work on Figure 1 now? I think I made a mistake with the way I cleaned that file: travel_wk.csv - I should have left the cases as daily not aggregated by week. The file travel.csv is okay (daily aggregation) but it has the additional variables for active cases that are not needed, and it is missing the close contacts. If you can make the figures, I can do the aggregation with patchwork. Thanks for all this.

francisanokye commented 2 years ago

@ahurford no problem at all. I started working on Figure 1, trying to get it right, and I will revert.

ahurford commented 2 years ago

@francisanokye if you can rewrite the data cleaning script so it makes the data for Figure 1 (replacing travel_wk.csv) thank you!

francisanokye commented 2 years ago

@ahurford the upper part of the Figure 1 seem to conform to the image obtained below, however, the last three images G, H and I are not coming out as required. Still working it out.

Screen Shot 2022-06-29 at 6 37 04 PM
francisanokye commented 2 years ago

@ahurford sorry I'm being so slow on this. I'm trying to have a good understanding of the data, so I can rectify it. Your further insights and directions are highly appreciated.

ahurford commented 2 years ago

Hi @francisanokye - I just need to get this finished now, so I will take over now with finishing this up - thanks for your help! (I expect I probably made an error somewhere...)

francisanokye commented 2 years ago

@ahurford I think I have seen what I was doing wrong after going through your code for the plot. Thanks for this opportunity to learn.