Chicago / predicting-e-coli-concentrations

This repository is part of the working draft for an upcoming an academic paper describing the methods and results of the City of Chicago Clear Water project.
2 stars 0 forks source link

Define other attributes were included in the models? #47

Closed tomschenkjr closed 6 years ago

tomschenkjr commented 6 years ago

Line 148- What other attributes were included in the models? Did the models include all of the meteorological data that is collected at the beaches as well as the characteristics of waves etc? This is unclear to the reader. Please include a list of all attributes included in the model. Intensity of waves can affect the water quality as it can cause E.coli and other organisms to be resuspended into the water from the beach sediment. These additional attributes may improve the performance of the model ( if not already included).

tomschenkjr commented 6 years ago

Need to clarify in text that the only attributes were qPCR levels at the 5 beaches and the name of the beach that is being predicted.

The "implicit" comment is why we are not including predictors that have been used in other models. In response, let's include other predictors to show the decline in performance. Let's limit the model with other predictors to only those with the full 10 years of covariates (excluding turbidity and other time-limited sensor-based data).

For now, do not include this "3rd model" in the paper, but prepare the results for discussion in a later meeting.

nicklucius commented 6 years ago

For the "3rd Model", here are the attributes I'll include.

Variables cited in our paper from prior literature

Precipitation

Sunlight

Wind

Tidal levels

Lake levels

Density of humans and animals

Variables NOT cited in our paper from prior literature

Lock openings

nicklucius commented 6 years ago

Adding the meteorological and hydrometeorological predictors ("other attributes") listed above resulted in a slight decrease in model performance. Ultimately, these "other attributes" were not used in our final model. Part of the reason was that there was no benefit to including them. And part of the reason was because our best model using only "other attributes" never proved to be any better than the USGS model.

Model Predictors Time Period AUC
E. coli Levels* 2006 - 2016 0.832
E. coli Levels* plus "other attributes" listed above 2006 - 2016 0.829
"Other attributes" only 2006 - 2016 0.627

*Same-day culture E. coli test results for Foster, North Ave, 31st, Leone, and South Shore

tomschenkjr commented 6 years ago

For the list of variables, precipitation, sunlight and wind were collected via Dark Sky / Forecast.io API, right?While tidal and lake levels were from what?

nicklucius commented 6 years ago

Correct about the API. Moon phase was used as a proxy for tides, which also came from Dark Sky. Lake levels are from NOAA, and they measure at Calumet Harbor.

Nick Lucius Data Scientist Advanced Analytics Department of Innovation and Technology City of Chicago (312) 744-5339 nicholas.lucius2@cityofchicago.org | data.cityofchicago.org | dev.cityofchicago.org | digital.cityofchicago.org


From: Tom Schenk Jr notifications@github.com Sent: Tuesday, June 12, 2018 3:31:02 PM To: Chicago/predicting-e-coli-concentrations Cc: Nicholas Lucius; Assign Subject: Re: [Chicago/predicting-e-coli-concentrations] Define other attributes were included in the models? (#47)

For the list of variables, precipitation, sunlight and wind were collected via Dark Sky / Forecast.io API, right?While tidal and lake levels were from what?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/Chicago/predicting-e-coli-concentrations/issues/47#issuecomment-396723329, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AQEqM0MhukCqOViQOP4H26PF9M2aXaNQks5t8CUGgaJpZM4TV0Y4.


This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.

tomschenkjr commented 6 years ago

Thanks. Re-reading #55, the reviewer mentioned that additional models can be discussed in the supplementary appendix. I will be moving the discussion about this model to the appendix.

tomschenkjr commented 6 years ago

@nicklucius - I've pushed a draft to the issue47 branch. I've included the table of variables, please insert the table showing results.

This is in the appendix so we can include in the supplement, but not edit the main paper.

nicklucius commented 6 years ago

@tomschenkjr - I've made my additions for this issue and pussed to issue47. Here are the commits where I forgot to add the #47 tag.

https://github.com/Chicago/predicting-e-coli-concentrations/commit/2deb4845b9b4e17015f15dc8f1e5c82b9d5aed22 https://github.com/Chicago/predicting-e-coli-concentrations/commit/f393f473cf04fb2a9618b40c4d0537785352e7c5 https://github.com/Chicago/predicting-e-coli-concentrations/commit/663d4d106ef51f00a4258e208436650d5f5c7035 https://github.com/Chicago/predicting-e-coli-concentrations/commit/6fb073041754477205e8cfc703f575c1a9210b13 https://github.com/Chicago/predicting-e-coli-concentrations/commit/79301f5debc3668d6ed827d2dc5052396c62760f https://github.com/Chicago/predicting-e-coli-concentrations/commit/12b8182d41c971847ab9b1dd506a507b728ed0cd

tomschenkjr commented 6 years ago

@nicklucius - can you make a pull request when you're ready to consider it for inclusion?

tomschenkjr commented 6 years ago

Reviewing the language, it seems the multivariate model is slightly higher than the interbeach correlation model (multivariate: 0.846 v. interbeach-only: 0.833). Is this correct?

nicklucius commented 6 years ago

Yes that's correct. Originally it was reversed (interbeach-only was slightly higher) but once I fixed the wind bearing coding in the model, the multivariate bumped up a bit.

tomschenkjr commented 6 years ago

Based on the performance of the multivariate model, we should include the multivariate model in the main body. I know it's not a big performance gain, but it'll be harder to argue why it's not included in the main portion.

Fortunately, most of the hard work has already been done. I think we just need to do the following (with associated people responsible):

tomschenkjr commented 6 years ago

@nicklucius - I just noticed a big difference in the AUC listed in Table 2 for the Hybrid model compared to Table A.3 (which compares to multivariate models). Table 2 shows AUC of 0.728 and Table A.3 shows its 0.837.

Which one is right?

nicklucius commented 6 years ago

Yes, there’s an explanation for that. Table 2 is showing the AUC of the Hybrid 2017 Pilot model that was operationalized, which trained on qPCR and therefore was limited to 2 years of data and did not have the k-means beach selection advantage. Table A.3 is showing a version of the Hybrid Model that trains on culture tests going back to 2006 and contains the k-means beach selections.

From: Tom Schenk Jr [mailto:notifications@github.com] Sent: Wednesday, July 18, 2018 10:20 PM To: Chicago/predicting-e-coli-concentrations predicting-e-coli-concentrations@noreply.github.com Cc: Nicholas Lucius Nicholas.Lucius2@cityofchicago.org; Mention mention@noreply.github.com Subject: Re: [Chicago/predicting-e-coli-concentrations] Define other attributes were included in the models? (#47)

@nickluciushttps://github.com/nicklucius - I just noticed a big difference in the AUC listed in Table 2 for the Hybrid model compared to Table A.3 (which compares to multivariate models). Table 2 shows AUC of 0.728 and Table A.3 shows its 0.837.

Which one is right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://protect2.fireeye.com/url?k=0388c55a7da17414.03881132-cf46755687e98d4e&u=https://github.com/Chicago/predicting-e-coli-concentrations/issues/47#issuecomment-406142143, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AQEqMxE3mLWxwC7i0OmEHMLfEOopfmEiks5uH_r1gaJpZM4TV0Y4.


This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.

tomschenkjr commented 6 years ago

@nicklucius - we may need to harmonize the methodology with its inclusion into the main body. It will simplify it for the reader.

I've pushed changes to issue47 with some of the tasks done. Take a look at the "Predictors and Covariates" section as well as Section 3. See how I'm shaping the narrative and see if that makes sense.

tomschenkjr commented 6 years ago

Per my previous comment, is it feasible to use the same training data set that we used in Section 3 to look at the performance of the multivariate model?

nicklucius commented 6 years ago

@tomschenkjr - I see how you've arranged it and I think it makes sense. So in Section 3 we'll present the 2017 pilot results and then provide direct comparisons to the Chicago prior-day model and our own multivariate model for the same training set. If I understand correctly, the idea is to show how the multivariate model compares to the Hybrid model under the same conditions (same training beach/days and same validation set).

I should be able to modify the multivariate model to essentially the same training/validation set as used for the 2017 pilot. I'll look at the code and work on it.

nicklucius commented 6 years ago

@tomschenkjr - I did my parts and pushed up to the issue47 branch. The only thing I'm not sure about is the reference above to Table 2 changing to Table 3. Let me know if something is still needed.

tomschenkjr commented 6 years ago

And this one is done!

@nicklucius - can you generate this paper? On my computer, I noticed that Table 2 should be the list of covariates and should be in Section 2.5. However, my computer is still placing it in Table A.2 in the appendix. I think it's just a problem on my end. Can you verify?

tomschenkjr commented 6 years ago

I've also merged and push this to dev.

nicklucius commented 6 years ago

Awesome!

@tomschenkjr - I think I'm getting the same thing as you. The "Beach Correlation" heat map is being placed in the appendix as Table A.2 rather than Table 2.

tomschenkjr commented 6 years ago

Do you get any error or warning messages when generating the PDF?

On Mon, Aug 27, 2018 at 3:24 PM Nick Lucius notifications@github.com wrote:

Awesome!

@tomschenkjr https://github.com/tomschenkjr - I think I'm getting the same thing as you. The "Beach Correlation" heat map is being placed in the appendix as Table A.2 rather than Table 2.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Chicago/predicting-e-coli-concentrations/issues/47#issuecomment-416356445, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkC0da0TtdVEsC4FnadpjT_fxjxD5elks5uVFV_gaJpZM4TV0Y4 .

--

Tom Schenk Jr. tomschenkjr@gmail.com tomschenkjr.net

nicklucius commented 6 years ago

None related to the table, just warning messages about 2 non-issues:

Warning messages:
1: Removed 304 rows containing missing values (geom_path). 
2: In in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE,  :
  You changed the working directory to /home/267226/predicting-e-coli-concentrations/clear-water (probably via setwd()). It will be restored to /home/267226/predicting-e-coli-concentrations. See the Note section in ?knitr::knit
tomschenkjr commented 6 years ago

Got the same errors. I'll fuss with the tables since it's not an issue unique to me.

On Mon, Aug 27, 2018 at 3:40 PM Nick Lucius notifications@github.com wrote:

None related to the table, just warning messages about 2 non-issues:

Warning messages: 1: Removed 304 rows containing missing values (geom_path). 2: In in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE, : You changed the working directory to /home/267226/predicting-e-coli-concentrations/clear-water (probably via setwd()). It will be restored to /home/267226/predicting-e-coli-concentrations. See the Note section in ?knitr::knit

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Chicago/predicting-e-coli-concentrations/issues/47#issuecomment-416361003, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkC0a3k-JmrJ2yHYPajUgzYrB-6NcCsks5uVFklgaJpZM4TV0Y4 .

--

Tom Schenk Jr. tomschenkjr@gmail.com tomschenkjr.net