alan-turing-institute / rds-course

Materials for Turing's Research Data Science course
https://alan-turing-institute.github.io/rds-course/
31 stars 13 forks source link

Module 2 #53

Closed jack89roberts closed 2 years ago

jack89roberts commented 3 years ago

Adds content for Module 2.

review-notebook-app[bot] commented 3 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

radka-j commented 2 years ago

Looking great, it's really comprehensive! Well done everyone! 🎉

I pushed a few small wording/grammar fixes in the first 4 sections. Will pick up the rest next week.

Some other suggestions below (of course feel free to ignore):

Legality and ethics

In the bias section, there are a couple of points you touch on where you immediately refer to a reference for more info where it might be nice to add a sentence outlining an example/give more context:

  1. In "Beyond data" you mention that bias doesn't just come from data - do you have a single sentence concrete example you could add here to help illustrate the idea of other sources of bias
  2. In "Should a variable be used" you mention Yudell and the debate around the use of race as a variable. It feels that here too it might be nice to have a sentence that develops the idea a little bit more. A rough suggestion might be something along the lines of: "It is good to ask yourself, why are you using a certain variable, what information does it capture and could there be other identifiers/variables better suited to your analysis and the question you want to answer.

Feature engineering

This is very minor, but in the binning section maybe explain the interval notation i.e., that "(min,max]" in this case means min < x <= max

Image data

You mention different interpolation methods, maybe add links that explain what how they work (or at least link to documentation on the methods as you do elsewhere)

Sections left to read:

radka-j commented 2 years ago

The biggest change I made (still very minor) is in the Database section of the Data sources and formats section. I moved the comparison of pros/cons of flat and relational databases to after the examples that give an idea of what these are.

lannelin commented 2 years ago

Thanks for the review so far and the great suggestions @radka-j! I've pushed some updates to "2.1.2 Legality and Ethics" and to "2.2.4.4 Image Data" in line (I hope) with the suggestions.

radka-j commented 2 years ago

Again, really well done everyone on all the hard work that has gone into this! I find the scope really impressive but also like that you manage to keep the details quite to the point and simple, focusing on just what is needed for the practicals. I really feel that it both gives a good flavour of the scope of what is the generally required knowledge without being (too) overwhelming 😊

radka-j commented 2 years ago

Some further comments on the remaining sections:

Privacy and annonymisation:

In the differential privacy section, again it might be nice to have some single sentence example that gives some intuition for what/how it is possible to learn something about an individual from a model for those who have not come across this idea before

Linking datasets:

This is super minor but do you have any links to reading on probabilistic matching? If not to hand it's fine, but it might be a nice to have.

Missing data section:

I wonder whether it is worth mentioning some more complex approaches like MICE (Multivariate Imputation by Chained Equations) for dealing with missingness - there seems to be a python package called fancyimpute that does this (btw, the answer might be that this is unnecessarily complicated to worry with here. For example, I have never used it in practice, I just know it exists).

I also think that @ots22 and @pwochner have been working on a project that deals with missing data. I wonder if it would be worth having their input on this section (e.g., maybe there is something you'd add in terms of how to think about missingness and why/how it occurs and what might be worth checking for in the data - the section perhaps doesn't need expanding but maybe there are some interesting blogs/resources on the topic to link to here).

radka-j commented 2 years ago

Data consistency:

I might have missed this so apologies if this is the case but do you have a link to the iris dataset documentation. You refer to it in the Text and Categorical Columns section when you say that you only expect 3 types of species in the data. It would be good to have a link to the docs somewhere at the start and at that point reiterate that checking the docs is an important part of checking the data.

radka-j commented 2 years ago

I have now read all the sections. I have not yet run through the hands-on exercise (just read over it) and I will do that later this week (probably Wednesday). Also let me know when the overview section is ready :)

ots22 commented 2 years ago

I also think that @ots22 and @pwochner have been working on a project that deals with missing data. I wonder if it would be worth having their input on this section

Definitely! Comments inbound soon...

jack89roberts commented 2 years ago

Thanks @radka-j ! I've made a few edits based on your suggestions, and also finished off the Feature Engineering section today. Still have some TODOs in the first where to find data section, and me & James still need to look at the overview. But I think we're getting there!