Looks really good already :) as discussed, hereby are just a few suggestions and thoughts to step up your DS game :)
The thing about data science is not always solely about teh code- it's also a lot about ways of working & reproducibility. Here are some suggestions along this line:
save your helper functions in a R-script (makes the use of notebooks a lot cleaner, such that notebooks are just used to call functions)
dependency management is quite important for someone else to replicate your project. You can include a requirements.txt file or use packrat, indicating which libraries are required for someone else to run and replicate your code.
generally it is also good to keep the directory structure in mind for data science there are plenty of examples out there, for example this one (at the bottom are also more links )
Notebooks are quite commonly used, but should not be the only tool. But as a first step, there are some suggestions to organise notebooks
really nice that you already mentioned some conclusions/ things to take away in comments- another approach would be to use markdown to format your text comments, makes it easier and prettier to read; same applies to header e.g. Exploratory analysis
You have a lot of really nice plots! Mention in markdown- what is being seen/ what you can conclude from plots
while functions like head , nrow or is.na are super useful and good to use to check that functions run as expected, they don't have to be saved in the notebooks output once committed to the repository as they create a lot of noise for someone who looks at it the first time.
don't comment out unused functions, just remove them! :) They are typically commented out for a reason and don't contribute to the readability of the notebook
If you want to print out several values to discuss them, use print() functions and print all values in a single cell. Also glue might be interesting, that would enable you to add some more information while printing e.g.
import glue
print( glue(Median of cumulative cases {median(WHO_COVID19_UK$Cumulative_cases)}))
print( glue(Median of new cases {median(WHO_COVID19_UK$New_cases)}))
print( glue(Median of cumulative deaths {median(WHO_COVID19_UK$Cumulative_deaths)}))
So and here are more data science related suggestions:
great that you have a train/ test split set, typically it would be good to try and sample it randomly from the data. With the current approach, you make it dependent on the index, which is a nice idea, but could come with issues if your data is sorted for instance according to time or countries. You can use r's build in sample() function for that. Here is an example of the implementation.
to even further investigate the performance of your model, it is typical to also look at residual plots
Linear regression models are sensitive values on different scales, as such scaling/ centering is typically recommended. Let's say you have a scale from 0 to 1 for feature 1 and a scale from 0 to 1000 for feature 2, given the different scales interpretation do get quite difficult. Here is a more extensive explanation of the issue if you are interested :)
Train set predictions are typically always better than test set predictions (if not something is really off haha), that's because the predictions on the train set are more likely to overfit, and the testing on the test set typically tells you how likely your model is generalizable. Therefore, be careful with making conclusions based on the model only with evaluating the train set performance.
Given the constructed prediction model which one would you chopose to use? A next step might be to look into the feature importances of those models before you make a decisions ;)
Please see the above really just as suggestions :) If anything is unclear, please do let me know, would love to further clarify it!
Hi Ada,
Looks really good already :) as discussed, hereby are just a few suggestions and thoughts to step up your DS game :)
The thing about data science is not always solely about teh code- it's also a lot about ways of working & reproducibility. Here are some suggestions along this line:
Notebooks are quite commonly used, but should not be the only tool. But as a first step, there are some suggestions to organise notebooks
head
,nrow
oris.na
are super useful and good to use to check that functions run as expected, they don't have to be saved in the notebooks output once committed to the repository as they create a lot of noise for someone who looks at it the first time.print()
functions and print all values in a single cell. Also glue might be interesting, that would enable you to add some more information while printing e.g.So and here are more data science related suggestions:
sample()
function for that. Here is an example of the implementation.Please see the above really just as suggestions :) If anything is unclear, please do let me know, would love to further clarify it!
Cheers, Janine