Chicago / west-nile-virus-predictions

Algorithm to predict repeated positive results for West Nile Virus for mosquitoes captured in traps across Chicago.
MIT License
14 stars 1 forks source link

Fix trap locations #8

Closed geneorama closed 7 years ago

geneorama commented 8 years ago

Currently we have several missing trap locations, and trap locations.

We have coordinates for all of the locations in our private database, but they may contain personal information such as "XYZ Household" with an exact address. To avoid releasing private data, we round the locations to a block level, and geocode the rounded location. However, this is problematic when the rounded location falls outside the city boundary (and the geocoder fails) and there may be other problems.

The current proposal is to create a table of locations that is manually derived as such:

Use the location coordinates when available from the public data When the public data is missing I'll add coordinates manually that correspond to a rounded address that I will manually geocode A better long term solution would be to include a rounded location in the private database which can be directly used for each trap.

tomschenkjr commented 8 years ago

Per our discussion earlier, let's keep track of this but fix the data issues through @levyj. Flag any issues here and be sure to @ mention him. Right now, just use the raw ZDT data and we'll delete the history of the repo before making it public again.

geneorama commented 8 years ago

Turns out that it's actually tricky to use the raw ZDT data.

There's one trap (maybe two traps) that they moved mid season so joining on TRAP_NAME doesn't work. Adding SEASON_YEAR fixes most cases, but the only way that I know to do it exactly right is to use the foverlaps join in data.table, and I had a hard time getting that to work because there are so many missing dates in the raw trap data.

geneorama commented 8 years ago

It's possible to filter the trap data to get a clean merge. I added a function to do this.

##------------------------------------------------------------------------------
## Create a clean table of Oracle XY coordinates that can be merged by
## trap / season_year.
## The resulting columns will be called OX and OY for "ORACLE X" and "ORACLE Y"
##------------------------------------------------------------------------------

filter_oracle_trap_xy <- function(traps){

    ## Subset, aggregate, and rename
    oracle_xy <- traps[ , list(.N),
                        keyby = list(trap = TRAP_NAME,
                                     season_year = SEASON_YEAR,
                                     OX = TRAP_COORD_X,
                                     OY = TRAP_COORD_Y)]
    oracle_xy <- oracle_xy[ , .SD, .SDcols=-"N"]

    ################################################################################
    ## There are two locations in 2010 for trap T035, but there are no observations
    ## in the first location. Remove the first location from the oracle reference
    ## in order to avoid duplicates in the merge.
    ################################################################################
    # traps[TRAP_NAME=="T035"& SEASON_YEAR==2010 ]
    # dat[trap=="T035" & season_year==2010]
    oracle_xy <- oracle_xy[!(trap=="T035" & OX==1162883 & season_year==2010)]

    ################################################################################
    ## There are two locations in 2010 for this T076, but the trap only moved a few
    ## feet... so whoop dee doo
    ## Remove the first location from the oracle reference to avoid duplicates in
    ## the merge.  Also, note that it's tricky to remove because the location has a
    ## fractional amount that doesn't print by default, hence the "trunc" below.
    ################################################################################
    # traps[TRAP_NAME=="T076"& SEASON_YEAR==2010 ]
    # oracle_xy[(trap=="T076" & season_year==2010)]
    # oracle_xy[(trap=="T076" & season_year==2010), OX][1] - 1179913
    oracle_xy <- oracle_xy[!(trap=="T076" & season_year==2010 & trunc(OX)==1179913)]

    return(oracle_xy)
}

This solves the problem so long as the repo is private, but we don't want the highly accurate / precise xy coordinates online.

geneorama commented 7 years ago

The issue of multiple disguised locations appearing on the portal for a single location has been fixed.

Many locations are missing on the portal, and this is still an issue. The missing locations occur when the city's geocoder fails. This happens when the location is outside the city limits, which happens often with these traps because they are often located on the city's perimeter. In some cases they are actually outside the city limit.

It is possible that there are missing locations for other unidentified reasons.

geneorama commented 7 years ago

We want to improve the model's information by taking into account the trap locations in the model. The main goals is to inform the model of bias on a per trap basis. This may require some form of dimension reduction as there are a large number of traps.

Currently there are 189 traps on the data portal. Of those, 30 are not geocoded correctly for various reasons, usually because they are at the boundaries of the city and may fall outside the technical city limits (and our geocoder only works for City of Chicago addresses).

Using Oracle data would clear this problem up, and provide more complete data.

Comments issue based on analytics meeting on Nov 7, 2016

geneorama commented 7 years ago

This issue continued to be complicated... but now there's an object that has a complete list of locations that are deduplicated.

Surprisingly, there are several places where there are multiple traps, but they have very different performance. There is also one case of a trap that was apparently reclassified as a GRAVID trap from CDC.

For reference

The locations of all traps in XY coordinates (from Oracle) image

Near O'Hare there is one location (rounded to nearest foot) with 7 traps: "T917" "T918" "T920" "T921" "T923" "T924" "T925". I thought this might be a mistake, but they all seem to have different performance, so they're probably different.

          date trap tot samples samples_with_wnv
 1: 2009-06-01 T917   4       1                0
 2: 2009-07-01 T918 341      11                0
 3: 2009-08-01 T918  46       3                0
 4: 2009-08-01 T920  27       3                0
 5: 2009-09-01 T918  32       4                0
 6: 2009-09-01 T920  30       5                0
 7: 2009-10-01 T918   6       2                0
 8: 2010-06-01 T918   8       3                0
 9: 2010-07-01 T918 435      12                0
10: 2010-07-01 T920   2       2                0
11: 2010-08-01 T921  25       5                0
12: 2010-09-01 T920   2       1                0
13: 2010-09-01 T921  20       6                0
14: 2010-10-01 T918  11       2                0
15: 2011-06-01 T918   1       1                0
16: 2011-07-01 T918 290       7                0
17: 2011-09-01 T918   7       3                0
18: 2012-06-01 T918   8       1                0
19: 2012-07-01 T918  67       7                0
20: 2012-08-01 T918  80       6                6
21: 2012-09-01 T918  30       3                0
22: 2013-06-01 T921  18       2                0
23: 2013-06-01 T923  56       4                0
24: 2013-06-01 T924  61       4                0
25: 2013-06-01 T925   6       3                0
26: 2013-07-01 T921  14       4                0
27: 2013-07-01 T923  64       5                0
28: 2013-07-01 T924  64       5                0
29: 2013-07-01 T925  16       3                0
30: 2013-08-01 T921  53       2                0
31: 2013-08-01 T923  97       7                0
32: 2013-08-01 T924 137       6                0
33: 2013-08-01 T925  14       5                0
34: 2013-09-01 T920  38       3                0
35: 2013-09-01 T923  91       4                0
36: 2013-09-01 T924  21       3                0
37: 2013-09-01 T925  12       2                0
38: 2014-06-01 T923  88       4                0
39: 2014-06-01 T924  52       3                0
40: 2014-07-01 T923  92       7                0
41: 2014-07-01 T924  48       6                0
42: 2014-07-01 T925 128       7                0
43: 2014-08-01 T923 107       7                0
44: 2014-08-01 T924  35       7                0
45: 2014-08-01 T925  26       6                0
46: 2014-09-01 T923  46       9                0
47: 2014-09-01 T924  17       8                0
48: 2014-09-01 T925  12       6                0
49: 2015-06-01 T923 114       4                0
50: 2015-06-01 T924 182       7                0
51: 2015-06-01 T925  14       3                0
52: 2015-07-01 T923 170       8                0
53: 2015-07-01 T924  14       4                0
54: 2015-07-01 T925 376      14                0
55: 2015-08-01 T923 334       8                0
56: 2015-08-01 T924  23       6                0
57: 2015-08-01 T925 530      15               15
58: 2015-09-01 T923 176      12                0
59: 2015-09-01 T924  22       5                0
60: 2015-09-01 T925  84       7                7
61: 2016-06-01 T918  55       3                0
62: 2016-07-01 T918 139       8                0
63: 2016-08-01 T917 319       7                7
64: 2016-08-01 T918  77       2                2
          date trap tot samples samples_with_wnv

There remain significant differences between the locations shown online and the locations in oracle

image

For reference, this is the plot for all the differences between all the traps in Chicago: image