Closed geneorama closed 7 years ago
Per our discussion earlier, let's keep track of this but fix the data issues through @levyj. Flag any issues here and be sure to @ mention him. Right now, just use the raw ZDT data and we'll delete the history of the repo before making it public again.
Turns out that it's actually tricky to use the raw ZDT data.
There's one trap (maybe two traps) that they moved mid season so joining on TRAP_NAME doesn't work. Adding SEASON_YEAR fixes most cases, but the only way that I know to do it exactly right is to use the foverlaps
join in data.table
, and I had a hard time getting that to work because there are so many missing dates in the raw trap data.
It's possible to filter the trap data to get a clean merge. I added a function to do this.
##------------------------------------------------------------------------------
## Create a clean table of Oracle XY coordinates that can be merged by
## trap / season_year.
## The resulting columns will be called OX and OY for "ORACLE X" and "ORACLE Y"
##------------------------------------------------------------------------------
filter_oracle_trap_xy <- function(traps){
## Subset, aggregate, and rename
oracle_xy <- traps[ , list(.N),
keyby = list(trap = TRAP_NAME,
season_year = SEASON_YEAR,
OX = TRAP_COORD_X,
OY = TRAP_COORD_Y)]
oracle_xy <- oracle_xy[ , .SD, .SDcols=-"N"]
################################################################################
## There are two locations in 2010 for trap T035, but there are no observations
## in the first location. Remove the first location from the oracle reference
## in order to avoid duplicates in the merge.
################################################################################
# traps[TRAP_NAME=="T035"& SEASON_YEAR==2010 ]
# dat[trap=="T035" & season_year==2010]
oracle_xy <- oracle_xy[!(trap=="T035" & OX==1162883 & season_year==2010)]
################################################################################
## There are two locations in 2010 for this T076, but the trap only moved a few
## feet... so whoop dee doo
## Remove the first location from the oracle reference to avoid duplicates in
## the merge. Also, note that it's tricky to remove because the location has a
## fractional amount that doesn't print by default, hence the "trunc" below.
################################################################################
# traps[TRAP_NAME=="T076"& SEASON_YEAR==2010 ]
# oracle_xy[(trap=="T076" & season_year==2010)]
# oracle_xy[(trap=="T076" & season_year==2010), OX][1] - 1179913
oracle_xy <- oracle_xy[!(trap=="T076" & season_year==2010 & trunc(OX)==1179913)]
return(oracle_xy)
}
This solves the problem so long as the repo is private, but we don't want the highly accurate / precise xy coordinates online.
The issue of multiple disguised locations appearing on the portal for a single location has been fixed.
Many locations are missing on the portal, and this is still an issue. The missing locations occur when the city's geocoder fails. This happens when the location is outside the city limits, which happens often with these traps because they are often located on the city's perimeter. In some cases they are actually outside the city limit.
It is possible that there are missing locations for other unidentified reasons.
We want to improve the model's information by taking into account the trap locations in the model. The main goals is to inform the model of bias on a per trap basis. This may require some form of dimension reduction as there are a large number of traps.
Currently there are 189 traps on the data portal. Of those, 30 are not geocoded correctly for various reasons, usually because they are at the boundaries of the city and may fall outside the technical city limits (and our geocoder only works for City of Chicago addresses).
Using Oracle data would clear this problem up, and provide more complete data.
Comments issue based on analytics meeting on Nov 7, 2016
This issue continued to be complicated... but now there's an object that has a complete list of locations that are deduplicated.
Surprisingly, there are several places where there are multiple traps, but they have very different performance. There is also one case of a trap that was apparently reclassified as a GRAVID trap from CDC.
For reference
The locations of all traps in XY coordinates (from Oracle)
Near O'Hare there is one location (rounded to nearest foot) with 7 traps: "T917" "T918" "T920" "T921" "T923" "T924" "T925". I thought this might be a mistake, but they all seem to have different performance, so they're probably different.
date trap tot samples samples_with_wnv
1: 2009-06-01 T917 4 1 0
2: 2009-07-01 T918 341 11 0
3: 2009-08-01 T918 46 3 0
4: 2009-08-01 T920 27 3 0
5: 2009-09-01 T918 32 4 0
6: 2009-09-01 T920 30 5 0
7: 2009-10-01 T918 6 2 0
8: 2010-06-01 T918 8 3 0
9: 2010-07-01 T918 435 12 0
10: 2010-07-01 T920 2 2 0
11: 2010-08-01 T921 25 5 0
12: 2010-09-01 T920 2 1 0
13: 2010-09-01 T921 20 6 0
14: 2010-10-01 T918 11 2 0
15: 2011-06-01 T918 1 1 0
16: 2011-07-01 T918 290 7 0
17: 2011-09-01 T918 7 3 0
18: 2012-06-01 T918 8 1 0
19: 2012-07-01 T918 67 7 0
20: 2012-08-01 T918 80 6 6
21: 2012-09-01 T918 30 3 0
22: 2013-06-01 T921 18 2 0
23: 2013-06-01 T923 56 4 0
24: 2013-06-01 T924 61 4 0
25: 2013-06-01 T925 6 3 0
26: 2013-07-01 T921 14 4 0
27: 2013-07-01 T923 64 5 0
28: 2013-07-01 T924 64 5 0
29: 2013-07-01 T925 16 3 0
30: 2013-08-01 T921 53 2 0
31: 2013-08-01 T923 97 7 0
32: 2013-08-01 T924 137 6 0
33: 2013-08-01 T925 14 5 0
34: 2013-09-01 T920 38 3 0
35: 2013-09-01 T923 91 4 0
36: 2013-09-01 T924 21 3 0
37: 2013-09-01 T925 12 2 0
38: 2014-06-01 T923 88 4 0
39: 2014-06-01 T924 52 3 0
40: 2014-07-01 T923 92 7 0
41: 2014-07-01 T924 48 6 0
42: 2014-07-01 T925 128 7 0
43: 2014-08-01 T923 107 7 0
44: 2014-08-01 T924 35 7 0
45: 2014-08-01 T925 26 6 0
46: 2014-09-01 T923 46 9 0
47: 2014-09-01 T924 17 8 0
48: 2014-09-01 T925 12 6 0
49: 2015-06-01 T923 114 4 0
50: 2015-06-01 T924 182 7 0
51: 2015-06-01 T925 14 3 0
52: 2015-07-01 T923 170 8 0
53: 2015-07-01 T924 14 4 0
54: 2015-07-01 T925 376 14 0
55: 2015-08-01 T923 334 8 0
56: 2015-08-01 T924 23 6 0
57: 2015-08-01 T925 530 15 15
58: 2015-09-01 T923 176 12 0
59: 2015-09-01 T924 22 5 0
60: 2015-09-01 T925 84 7 7
61: 2016-06-01 T918 55 3 0
62: 2016-07-01 T918 139 8 0
63: 2016-08-01 T917 319 7 7
64: 2016-08-01 T918 77 2 2
date trap tot samples samples_with_wnv
There remain significant differences between the locations shown online and the locations in oracle
For reference, this is the plot for all the differences between all the traps in Chicago:
Currently we have several missing trap locations, and trap locations.
We have coordinates for all of the locations in our private database, but they may contain personal information such as "XYZ Household" with an exact address. To avoid releasing private data, we round the locations to a block level, and geocode the rounded location. However, this is problematic when the rounded location falls outside the city boundary (and the geocoder fails) and there may be other problems.
The current proposal is to create a table of locations that is manually derived as such:
Use the location coordinates when available from the public data When the public data is missing I'll add coordinates manually that correspond to a rounded address that I will manually geocode A better long term solution would be to include a rounded location in the private database which can be directly used for each trap.