[WIP] Using wastewater data to monitor SARS-CoV-2 in England

AoifeHughes commented 2 years ago

Summary

Created the story as outlined in issue: https://github.com/alan-turing-institute/TuringDataStories/issues/193

List of changes proposed in this PR (pull-request)

Adds a python jupyter notebook story
Several data files used in the story
Instructions on how to prepare data for analysis

What should a reviewer concentrate their feedback on?

[ ] Amount of background information
[ ] Visuals are clear
[ ] techniques are explained clearly
[ ] Reasoning is made clear
[ ] General suggestions to increase quality of notebook
[ ] Everything looks ok?

Acknowledging contributors

[ ] All contributors to this pull request are already named in the table of contributors in the README file.
[x] The following people should be added to the table of contributors in the README file: AoifeHughes

review-notebook-app[bot] commented 2 years ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

crangelsmith commented 2 years ago

Thank you @AoifeHughes! We prefer not to store the data here in the TDS repo, could you put the data somewhere citable like Zenodo and then in the notebook download it and load it?

AoifeHughes commented 2 years ago

Yes! Will move the data out and update soon!

AoifeHughes commented 2 years ago

Commit #9692c20 has added a link and download code for the data

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:07Z ----------------------------------------------------------------

I suggest a proof read with a focus on general grammar and flow.

Some initial examples:

In London, around 1854 Jon Snow -> In London, around 1854, Jon Snow

...testing mechanisms -> ...testing mechanisms:

for reducing other bias' such as socio-economic as everyone... -> for reducing other biases, such as socio-economic, as everyone...

AoifeHughes commented on 2022-10-11T07:41:05Z ----------------------------------------------------------------

✅

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:08Z ----------------------------------------------------------------

Again, content is good but some proof reading needed.

E.g.

Several countries have been activity using wastewater as a data collection method including: -> Several countries have been actively collecting data using wastewater, including:

I suggest that we are more specific in the "key questions" sentence, since this will help the reader follow the notebook later on when you are narrowing focus.

...key questions are whether wastewater can be monitored for COVID-19 particle concentration and if this is even useful data? ->

...key questions are: 1) can COVID-19 particle concentration be measured in wastewater? 2) Can the measurements be used to track COVID-19 prevalence?

I also suggest highlighting the stages of the notebook at the end of this notebook. Something like:

"The research question of interest is whether COVID-19 concentration, as measured from wastewater, tracks COVID-19 case numbers. This notebook first explores the reliability of the COVID-19 concentration measurements across both space (different UK regions) and time, then models how well the wastewater measurements predicts case numbers."

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:09Z ----------------------------------------------------------------

typo: organsing

unclear what the running and installation file is - can we link?

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:10Z ----------------------------------------------------------------

It is unclear what is meant by "entries of non-tabular data surrounding the elements we are interested in". I think it is clearer to say that there are header rows that we are not interested in.

It would be clearer and more concise, I think, to say:

For example, the data is provided by the English government in an imperfect format for data processing. There are header rows we are not interested in, and there are multiple "sheets" of data. Using pandas we can easily skip the first few rows and read a specific sheet when we load the data

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:11Z ----------------------------------------------------------------

So really there is only 2 identifiers -> So really there are only 2 identifiers.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:12Z ----------------------------------------------------------------

it is much smaller a region -> it is a much smaller region

As a reader I'm not sure how the total number of sites and regions will help direct the analysis? Can you explain why it helps?

e.g. We want to assess whether you can track COVID-19 with wastewater. You could do this at different levels of granularity: for each site, for each region, or for the entire country.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:12Z ----------------------------------------------------------------

This is the first time 'case numbers' is introduced. It would be helpful if case numbers were introduced in the introduction so it is clearer where the notebook is headed.

Where each region... -> Each region...

AoifeHughes commented on 2022-11-09T10:23:46Z ----------------------------------------------------------------

https://github.com/alan-turing-institute/TuringDataStories/pull/206/commits/4e57d5be93fec1f889dfdc82fbed2cd93f1a3366 - solves this and a few prior points.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:13Z ----------------------------------------------------------------

Not clear at this stage why do you need to take the sum of the population? I would include something simple here that disease transmits more readily in densely populated areas so population could usefully explain differences in COVID-19 concentration across regions.

Also would be clearer if you state that you are averaging across sites.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:14Z ----------------------------------------------------------------

Might be useful here to state why percentage difference of the mean is useful here. E.g. generally, variation around an average increases as the magnitude (size) of the average increases, so we would expect a 20k difference between 20-40k to be more notable than a difference between 80k-100k. Comparing percentage difference from the mean could therefore more usefully highlight differences than comparing the absolute values.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:15Z ----------------------------------------------------------------

I would add % into the y-label so that a reader flicking through the figures and not reading the texts does not get confused.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:16Z ----------------------------------------------------------------

Is there a reason why this is absolute rather than signed? I think it may be more instructive to be signed, since the reader knows that the mean would lead to over/under estimation.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:17Z ----------------------------------------------------------------

Can you give an indication here of what magnitude of values here are to be considered notable?

Also, is it possible to be clearer what you mean by "true insights"? I think you mean that the between-region variation itself varies substantially over time, so much so that the national mean would not be a useful signal to track the case numbers of a region at a specific time window? Maybe pick out an example to illustrate this point.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:17Z ----------------------------------------------------------------

This is nice. But the exponential curve should be given as $y = me^{bx}$. You are not wanting to do m^bx.

To help intuition, maybe state that $b$ controls that rate of growth, and $m$ is the value when time (here, x) = 0. (since $e^0 = 1$).

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:18Z ----------------------------------------------------------------

Line #3.        return m*np.exp(b*x)

See comment in the previous cell wrt exponential formula. The code version is correct but does not match with the formula in the text (which is wrong).

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:19Z ----------------------------------------------------------------

Note that given the WW pattern the fit will vary massively due to the training values picked.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:20Z ----------------------------------------------------------------

may be clearer here if you write out the exponential formula again, indicating wastewater and cases.

I found it hard to follow why you were predicting cases from wastewater given the relatively flat relationship. I think the logic of the text + figure is that wastewater does not follow an exponential relationship therefore would not be a good predictor of covid cases (implicitly assume that you run wastewater through an exponential function).

But that's a bit confusing. If both wastewater and cases could be modelled exponentially, then I would expect the relationship with each other to be linear rather than exponential (as you say, with some proportionality). If wastewater was a monotonically increasing signal, like date, then you would expect a nice exponential relationship when predicting cases.

Neither of those are the case. In the above plot you show that the concentration training values decrease over time, on average. Maybe point out that this would make it difficult to capture the exponential growth in cases, and the best you could hope for is a flat line (I think?).

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:21Z ----------------------------------------------------------------

I keep having to flick back and forth to remind myself what y1, y2, y3 etc are. Perhaps consider giving them more informative names? Or in the text clearly write the formula with the names specified.

AoifeHughes commented on 2022-11-09T13:13:19Z ----------------------------------------------------------------

Completely agree, the variable namings are horrible here - will take a minute to actually sort these out, though! Will come back to this one

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:22Z ----------------------------------------------------------------

Line #4.    popt, pcov = curve_fit(exp_f, x1, y1_est, p0=[start,start], maxfev=50000)

I was confused by the two-stage approach, until I realised that stage was simply mapping the y1_est onto the wider date range, with no loss of info because it already has an exponential form. Maybe add a comment about this?

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:23Z ----------------------------------------------------------------

The formula is different to the code.

I think what is wanted is:

$y = me^{bx_1} * cx_2$.

Maybe explain also that date is going to be exponentially related, but that this is scaled by wastewater (also, why scaled instead of additive? is it because we are still assuming that WW is proportional to cases?)

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:24Z ----------------------------------------------------------------

From eyeballing the graphs it is confusing why this model is better than the date-only model. The date-only model looks like a better fit. Do you have an idea why?

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:24Z ----------------------------------------------------------------

The last entry is misleading.

I read case # as "case numbers", and also the gc/l with exponential function is misleading because you are not exponentiation gc/l anymore, I believe.

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:26Z ----------------------------------------------------------------

it might be worth talking through how the conclusion that wastewater data is affected by the rain. From the above plot I don't intuit much of a relationship?

The formula specification is much nicer here (though still, you do not exponent $a$, you do $ae^{bx_d}$), and it allows for easier comparison with the code.

AoifeHughes commented on 2022-11-09T13:26:43Z ----------------------------------------------------------------

Mentioned in the section # weather data have added clarity also

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:27Z ----------------------------------------------------------------

Line #2.        return m*np.exp(d*X[0]) * (X[1]*X[2]*w)

can we use the same values that you used in the formula?

review-notebook-app[bot] commented 2 years ago

View / edit / reply to this conversation on ReviewNB

callummole commented on 2022-10-10T13:55:28Z ----------------------------------------------------------------

Unless I've missed something (definitely possible), I think a couple of the conclusions aren't support by the analysis:

the inclusion of rainwater giving some increase in accuracy -> The conclusion isn't supported. We show that date + wastewater is better (R^2 = .82) than date + rain*wastewater (R^2 = .36). We do not compare rain*wastewater with wastewater alone, this version has date in.

Wastewater gc/l could be used to monitor disease trend -> I'm unconvinced by this claim based on your analysis. Seems to me that your analysis shows that wastewater does not add much, if you can already fit date to some case numbers (i.e. wastewater cannot replace case numbers). Based on the regional plots I was expecting to see a simple linear regression showing that it fits well through winter 2021, but not for july->sept 2021, where you have the discrepancy between WW and case numbers throughout most/all regions.

I think the combination of an exponential model and specific training values means that wastewater shows up to be a very poor predictor, mainly because of this divergence of trends in summer 2021. The inclusion of modulating gc/l with rainfall doesn't improve things (in fact, $R^2$ decreases).

AoifeHughes commented 2 years ago

✅

View entire conversation on ReviewNB

AoifeHughes commented 2 years ago

[ ]

View entire conversation on ReviewNB

AoifeHughes commented 2 years ago

https://github.com/alan-turing-institute/TuringDataStories/pull/206/commits/4e57d5be93fec1f889dfdc82fbed2cd93f1a3366 - solves this and a few prior points.

View entire conversation on ReviewNB

AoifeHughes commented 2 years ago

Completely agree, the variable namings are horrible here - will take a minute to actually sort these out, though! Will come back to this one

View entire conversation on ReviewNB

AoifeHughes commented 2 years ago

Mentioned in the section # weather data have added clarity also

View entire conversation on ReviewNB

AoifeHughes commented 2 years ago

Super important point raised during review that needs to be done after solving other issues (so as not to have to repeat, or diverge again)

[ ] Refactor naming of modelling variables to make more sense
[ ] Create consistency of variable names
[ ] Add clarity in array values and match with names in equations

alan-turing-institute / TuringDataStories