Closed josiahseaman closed 10 years ago
The IterationByUnit isn't output right now, but I will create a module to write it as a supplemental output.
Zones can be derived by knowing the radius and the events that can trigger zones. Currently the detection and tracing processes can triggers zones (see the zones section in the model spec). If it's more straightforward, we could also just write a csv file to supplemental outputs containing run, day, lat, long of every zone circle created.
@josiahseaman I have assumed the Results_unitstats table will be populated with the units. Let me know if that's not a safe assumption and I can switch to "update or insert" statements instead of "update" statements.
Oh derp. Yes, you're correct that's what should have been done. I'll implement that now.
The good news is that the stats are hitting the database:
>>> model_to_dict(UnitStats.objects.filter(cumulative_vaccinated__gte=1)[11])
{'cumulative_vaccinated': 2, 'cumulative_zone_focus': 20, 'cumulative_infected': 20, 'cumulative_destroyed': 0, u'id': 299, 'unit': 300}
Bad news is we're definitely getting write lock contention. I proposed a solution in #150.
I commented on #150, but can comment here also. I think we can easily split the database if needed.
Amy D and I were chatting. We want to make sure we can access the image that is created in a high resolution format - so that it doesn't have to be recreated in Arc or something.
Also, need to have legends and scale so we can understand what we are seeing.
And she really likes it!
Also, will the background have some basic identifers (state boundary, county boundary) so we understand where in the universe we are?
To your most recent question the starter answer is "no" because that information is not in the Population file the user uploads. We can put line drawing overlays in the "Future Features" milestone. I still maintain we'd want to do that kind of geospatial markup in an existing API like Google Maps.
I was just going to start on the drawing portion of this feature. I think this is more than matplotlib will handle gracefully. I'm considering drawing it from scratch. For my first iteration, I'd prefer not to learn a set of GIS tools right off the bat.
And thanks for the compliment on the map idea. I watched a talk on how most people just rush to the first solution and was inspired to just take a while to reflect on what we actually need to know about a simulation. I'm still not sure how I'm going to overlay 3 different stats #infected #destroyed #vaccinated all on one unit. I'll start with just infections and see what I can manage.
Doing stress tests for feasibility with mpld3 interactive graphing. Using the Tooltip demo because it has circles of varying size.
10,000 data points returns within 5 seconds:
100,000 data points took 10 seconds to respond, a full minute for the browser to render. And it's got about a 1 second lag on mouse tips. It takes about 14 seconds to re-render (pan, zoom). Not the best, but it's within an order of magnitude of what we'll want. I'll see if I can work with it:
This is going well. I've got lat, long, and name working. Colored by Production Type I can finally see how many pigs vs cattle there are in this scenario:
@missyschoenbaum @ndh2 The good news is that I've now got it coloring by number of times infected.
The bad news is that I think my sample scenario is pretty bogus. It looks like 90% of Units on the map get infected in all 5 iterations. I'm planning on re-running a set of iterations given all the changes we've made lately. I think the parameters that I copied from NAADSM Sample Scenario aren't realistic. Can someone tell me what is in at least the right order of magnitude for "Contacts per day" and "Infection probability"? I can also disable Air borne spread.
I know I can't get real numbers, but it's a bit hard to notice small errors when my map is a smear of red. Any help possible would be appreciated.
I'm going to run a new simulation given our latest field changes and the more realistic parameters I was given. This will be the new "Roundtrip" and "Full Run" scenarios.
Now here's a more interesting result, I ended up using a contact rate of 0.4 just because it yielded interesting looking results.
@missyschoenbaum Now seems like a good time to talk about performance characteristics. I'm trying to decide if I should switch to a different technology before I put anymore work into this. Based on Sample Scenario and the Scenario with 6 states, 400,000 Units. Here's the facts:
I think our reasonable options are:
@ndh2 chime in if you have an opinion.
It is also possible that we could find Bokeh is the bleeding edge.
This is a task where I could accept a quick implementation that gets this version up going, and plan that we use our extended data visualization time to try one of the more time intensive solutions.
We know that the 400K population would crash the model file in old NAADSM. Analysis haven't had the expectation that they are going to get to see a pop that big.
I guess that means I vote for Option 1, with the plan to look into the other options for our next release.
Having a map that pans and zooms will be great. A lot of simulations produce a few "clusters" so being able to focus in on those is handy.
A thought about large data sets: would it help to render the locations of the whole population as a static background image and only have interactive elements where something happened (infection, vaccination, etc.) Might work as a middle-ground kind of solution for large data sets.
That's a brilliant solution. It will still take 30 seconds for matplotlib to render the 400k population. And I'll have to do some programming to verify the coordinates match. So we'll do a couple iteration of this feature. Right now I just say if Pop > 20k use the fast non-interactive solution. I was going to tackle rendering zones or des/vac rings next.
An update on Bokeh: The section of Bokeh that is feature equivalent with mpld3 is a mature product and ready to roll out the door. The section of project that is focused on big data is called Abstract Rendering and I can't tell if it's ready for prime time or not. This page looks good and their example for hdalpha describes exactly what I'd like to do:
The hdalpha recipe is useful for scatterplots with multiple categories or geo-located event data where events are of different types. In the hdalpha recipe, categories are binned separately and a color ramp is made for each category. Additionally, the composition between categories is also controlled to prevent over-saturation.
I'd like to look into this more later.
Still working on a Zone solution: http://stackoverflow.com/questions/9081553/python-scatter-plot-size-and-style-of-the-marker
I fixed the color normalization so that a Unit that is infected rarely (down to 0.1%) is colored differently than a Unit that was never infected in any iteration. This seems to be rendering a bit slower now.
I've figured out the basics of drawing zone circles that are fixed to the GIS grid instead of the display grid. And I've figured out how to put those circles behind the Units. So what happens is the Units are a fixed number of pixels, they take up a lot of the screen when you're zoomed out.
Then as you zoom in, the Units stay the same number of pixels as the map grows and they become spaced out. The zones are fixed to an actual GIS size (5km in this case) and so their apparent size grows as you zoom in.
This is still a very early proof of concept. Ideally, I'd like to paint this onto a static image and use it as the backdrop for the interactive Units. But, features first, optimization second.
@missyschoenbaum @bacorso This is getting me to wonder about my Sample Scenario again. Zones seem really tiny in comparison to the spacing of Units. 5km is probably not a realistic zone size based on what I'm seeing here, it'll grab 0-2 nearest neighbors. I also noticed that the Sample Scenario is taller than it is wide. I'm using a simple latitude conversion and I suppose this data was generated well off the equator.
Barbara is off circling the globe, so I don't know if she will answer. I just know the sample pop was set up to be intentionally and obviously fake. @ndh2 do you know more?
Also, I thought a static base population was a brilliant plan.
Alright, here's what it looks like with z-ordering shaded by how often a Unit is a zone focus. It's interesting you can spot Units that were never infected but were a zone focus (forward tracing?) as well as infected Units that didn't get focused, probably never detected.
I don't like the spotty look of the zones, but that's an accurate representation of what the simulation is doing. Many separate small zones. I had pictured zone more like the expanding blob of quarantine tape. Set at 15km radius, this is more of the look I was going for.
I'm going to take a coding break to figure out my color schemes as I add more data layers.
I now have black rings to indicate destruction of the Unit. This is really fun.
This is getting me to really look at how the program works in a lot more detail. I need to figure out how to represent vaccination (green) and I was hoping to use lines of causality connecting Units. There are ring vaccinations, but I noticed there seems to be no link between tracing and vaccination. @ndh2 Did I miss something?
Destruction has:
Destruction is a ring target Indicates if unit of this production type will be subject to preemptive ring destruction.
Destroy direct forward traces Indicates if units of this type identified by trace forward of indirect contacts will be subject to preemptive destruction.
Destroy indirect forward traces
Destroy direct back traces
Destroy indirect back traces
But vaccination only has:
Trigger vaccination ring Indicates if detection of a clinical unit of this type will trigger a vaccination ring.
Is vacciantion by tracing a V33 feature?
No, traces don't trigger vaccination, either in the current version or in any proposed changes.
Alright, the cause really affects how I want to visualize this on a map. So I see that there are two causes for vaccination: 1) detection 2) ring vaccination from a nearby detected unit. Now I also see that detection is assisted by: 1) Zone surveillance, 2) Lab tests, 3) exams. All three of which can be brought on by tracing. So it seems to go Tracing > 3 investigations > detection > vaccination. But there's no path where a unit will be vaccinated because of a distant trace until an infection is verified. Did I get all that right?
That's right. There's only 1 direct way to cause vaccination (detection), but several other processes can create or speed up detections.
Each unit with non-zero stats will be represented as a square, divided horizontally into 3 bar graphs. The three statistics progress left to right in chronological order. 1) Infection (red), 2) vaccination (green) 3) destruction (black).
The triple bar graph is working as intended and it definitely shows me a lot more information about that state transitions that each Unit goes through. Once I added this graph, it became apparent that vaccination were never happening even though they were turned on because Ring Destruction was enabled but not Ring Vaccination. I flipped my parameters and re-ran the scenario with these results. Now I can see when Unit are being preemptively vaccinated or destroyed and when one state blocks the other.
The issue is that as I add new features the map load time is getting slower and slower. SVG isn't really designed to handle as many shapes as I'm throwing at it. I'm going to look into using matplotlib to render a large static image and then wrap that in some frame that can zoom and pan.
My understanding of the model is increasing as I work with this visualization. I spotted a number of Units where there was an infection, but no destruction or vaccination, even though destruction is selected. Am I correct in thinking that these are infections that progressed completely without ever being detected?
Final report for the day: I've got a very large static image rendering that shows all the necessary information without the need for interactivity. I ran 1,000 iterations and was just looking at the outputs.
The majority of the map area is taken up by the 1% of cases that spread out of control. Whereas 50% of the time the infection never takes off from the source unit the total possible coverage is quite large. So there's a question of emphasis here. Is it correct to visually emphasize the 1% worst scenarios or to emphasize that half the scenarios show 100% containment on day 1? If it is better to emphasize the nasty edge cases, we may want to consider using a log scale because otherwise the single pixel red/green/black line is a bit hard to pick out.
For the question above: it is possible to have an infection go undetected. It all depends on the parameters: a combination of short infectious clinical period and low detection probability is where you're likely to see that.
Now that I have a large static image I can verify it render 5x faster for small populations, and more importantly, scales to large populations without crashing. So now I need to find an image zoom utility so that I can embed a frame in the Results Home and allow the user to zoom and pan on the very detailed image. Here's what I found:
http://openseadragon.github.io/
This cool demo depends on computing images using:
https://github.com/openzoom/deepzoom.py
My concern is that the image conversion process is going to add compute time on top of the image generation and kill any scalability I was hoping to gain. The trouble is I can't know that until I try it out. I can always fall back on the browser's innate zooming ability if this doesn't work.
Or maybe I'll just go with the much simpler jQuery Zoom Plugin with mouseover.
Here's jquery zoom, nice and simple.
Here my mouse is over the "small image". It's set to show a zoom cursor, though it's not showing for some reason.
When I click on the image, it zooms in and locks pan to my mouse position. I can free my mouse by clicking again.
Next, I'm going to figure out better styling/layout and then profile the python to figure out how to speed the process up.
Nice!
For the sake of posterity, here are the profiling results (because this issue thread wasn't long enough):
<pre> 4898450 function calls (4832323 primitive calls) in 30.235 seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
1 3.238 3.238 6.592 6.592 c:\python27\lib\site-packages\matplotlib\backends\backend_agg.py:504(print_png)
113000 1.809 0.000 2.638 0.000 c:\python27\lib\site-packages\matplotlib\transforms.py:81(__init__)
3295 1.566 0.000 1.862 0.001 c:\python27\lib\site-packages\matplotlib\artist.py:883(get_aliases)
7917 1.219 0.000 1.426 0.000 c:\python27\lib\site-packages\django\db\backends\sqlite3\base.py:446(execute)
27701 0.797 0.000 1.837 0.000 c:\python27\lib\site-packages\django\db\models\sql\query.py:213(clone)
113000 0.653 0.000 0.829 0.000 c:\python27\lib\weakref.py:47(__init__)
3285 0.508 0.000 0.956 0.000 c:\python27\lib\site-packages\matplotlib\backends\backend_agg.py:122(draw_path)
13114 0.418 0.000 0.451 0.000 c:\python27\lib\site-packages\matplotlib\transforms.py:1790(rotate)
24148 0.410 0.000 0.527 0.000 c:\python27\lib\site-packages\matplotlib\transforms.py:1837(translate)
7917 0.372 0.000 0.456 0.000 c:\python27\lib\site-packages\django\db\backends\__init__.py:839(last_executed_query)
71236 0.355 0.000 0.355 0.000 c:\python27\lib\site-packages\django\utils\datastructures.py:127(__init__)
19024 0.348 0.000 0.348 0.000 c:\python27\lib\site-packages\numpy\core\_methods.py:35(_all)
7916 0.346 0.000 2.378 0.000 c:\python27\lib\site-packages\django\db\models\sql\compiler.py:64(as_sql)
7915 0.329 0.000 0.650 0.000 c:\python27\lib\site-packages\django\db\models\sql\compiler.py:255(get_default_columns)
21625/9195 0.314 0.000 0.952 0.000 c:\python27\lib\site-packages\matplotlib\transforms.py:2233(get_affine)
11060 0.306 0.000 5.205 0.000 c:\python27\lib\site-packages\matplotlib\patches.py:559(_update_patch_transform)
15871 0.303 0.000 0.316 0.000 c:\python27\lib\site-packages\django\db\utils.py:104(inner)
277198 0.296 0.000 0.296 0.000 c:\python27\lib\site-packages\matplotlib\artist.py:973(is_alias)
14426 0.292 0.000 0.501 0.000 c:\python27\lib\site-packages\matplotlib\transforms.py:766(__init__)
11871 0.290 0.000 0.327 0.000 c:\python27\lib\site-packages\django\db\models\base.py:325(__init__)
83220 0.287 0.000 0.616 0.000 c:\python27\lib\weakref.py:98(__setitem__)
47202 0.272 0.000 0.888 0.000 c:\python27\lib\site-packages\matplotlib\transforms.py:157(set_children)
65211 0.264 0.000 0.264 0.000 c:\python27\lib\site-packages\matplotlib\cbook.py:659(iterable)
19786 0.259 0.000 6.513 0.000 c:\python27\lib\site-packages\django\db\models\query.py:160(iterator)
7917 0.227 0.000 2.279 0.000 c:\python27\lib\site-packages\django\db\backends\util.py:66(execute)
32710 0.222 0.000 1.224 0.000 c:\python27\lib\site-packages\matplotlib\transforms.py:2134(__init__)
27701 0.221 0.000 2.153 0.000 c:\python27\lib\site-packages\django\db\models\query.py:837(_clone)
7914 0.220 0.000 1.314 0.000 c:\python27\lib\site-packages\django\db\models\sql\query.py:1008(build_filter)
14850 0.216 0.000 0.331 0.000 c:\python27\lib\site-packages\matplotlib\colors.py:326(to_rgba)
39234/39228 0.212 0.000 0.392 0.000 c:\python27\lib\site-packages\matplotlib\units.py:121(get_converter)
7916 0.207 0.000 0.207 0.000 c:\python27\lib\site-packages\django\db\backends\sqlite3\base.py:456(convert_query)
13359 0.201 0.000 0.201 0.000 c:\python27\lib\site-packages\numpy\lib\twodim_base.py:170(eye)
19785 0.200 0.000 6.148 0.000 c:\python27\lib\site-packages\django\db\models\fields\related.py:183(__get__)
71236 0.190 0.000 0.190 0.000 c:\python27\lib\site-packages\django\utils\datastructures.py:122(__new__)
55402 0.188 0.000 0.360 0.000 c:\python27\lib\site-packages\django\db\models\sql\where.py:292(clone)<pre> ---- By file ----
This second profile is run with all of the matplotlib code commented out.
<pre> 2329795 function calls (2321268 primitive calls) in 12.504 seconds
Ordered by: internal time, call count
ncalls tottime percall cumtime percall filename:lineno(function)
7916 1.221 0.000 1.428 0.000 c:\python27\lib\site-packages\django\db\backends\sqlite3\base.py:446(execute)
27701 0.792 0.000 1.808 0.000 c:\python27\lib\site-packages\django\db\models\sql\query.py:213(clone)
7916 0.457 0.000 0.537 0.000 c:\python27\lib\site-packages\django\db\backends\__init__.py:839(last_executed_query)
71242 0.348 0.000 0.348 0.000 c:\python27\lib\site-packages\django\utils\datastructures.py:127(__init__)
7916 0.341 0.000 2.338 0.000 c:\python27\lib\site-packages\django\db\models\sql\compiler.py:64(as_sql)
7915 0.323 0.000 0.639 0.000 c:\python27\lib\site-packages\django\db\models\sql\compiler.py:255(get_default_columns)
15871 0.311 0.000 0.323 0.000 c:\python27\lib\site-packages\django\db\utils.py:104(inner)
11871 0.286 0.000 0.323 0.000 c:\python27\lib\site-packages\django\db\models\base.py:325(__init__)
19786 0.261 0.000 6.544 0.000 c:\python27\lib\site-packages\django\db\models\query.py:160(iterator)
7916 0.220 0.000 2.358 0.000 c:\python27\lib\site-packages\django\db\backends\util.py:66(execute)
27701 0.219 0.000 2.119 0.000 c:\python27\lib\site-packages\django\db\models\query.py:837(_clone)
7914 0.215 0.000 1.291 0.000 c:\python27\lib\site-packages\django\db\models\sql\query.py:1008(build_filter)
7916 0.207 0.000 0.207 0.000 c:\python27\lib\site-packages\django\db\backends\sqlite3\base.py:456(convert_query)
19785 0.201 0.000 6.149 0.000 c:\python27\lib\site-packages\django\db\models\fields\related.py:183(__get__)
71242 0.186 0.000 0.186 0.000 c:\python27\lib\site-packages\django\utils\datastructures.py:122(__new__)
3957 0.180 0.000 5.880 0.001 c:\python27\lib\site-packages\django\db\models\fields\related.py:287(__get__)
55402 0.180 0.000 0.349 0.000 c:\python27\lib\site-packages\django\db\models\sql\where.py:292(clone)
7916 0.177 0.000 0.315 0.000 c:\python27\lib\site-packages\django\db\models\sql\query.py:105(__init__)
7914 0.170 0.000 0.676 0.000 c:\python27\lib\site-packages\django\db\models\sql\where.py:166(make_atom)
51454 0.161 0.000 0.161 0.000 c:\python27\lib\site-packages\django\db\backends\sqlite3\base.py:207(quote_name)
7916 0.152 0.000 5.615 0.001 c:\python27\lib\site-packages\django\db\models\sql\compiler.py:757(execute_sql)
55404 0.152 0.000 0.570 0.000 c:\python27\lib\site-packages\django\utils\datastructures.py:245(copy)
102890 0.147 0.000 0.147 0.000 c:\python27\lib\site-packages\django\utils\tree.py:18(__init__)
7916 0.143 0.000 0.841 0.000 c:\python27\lib\site-packages\django\db\models\sql\compiler.py:173(get_columns)
1 0.142 0.142 12.417 12.417 c:\users\josiah\documents\spreadmodel\results\interactive_graphing.py:110(population_results_map)
35617 0.140 0.000 0.455 0.000 c:\python27\lib\site-packages\django\db\models\query.py:34(__init__)
7914 0.136 0.000 0.361 0.000 c:\python27\lib\site-packages\django\db\models\sql\where.py:355(process)
27699 0.127 0.000 0.367 0.000 c:\python27\lib\site-packages\django\db\models\sql\where.py:49(_prepare_data)
11871 0.126 0.000 0.182 0.000 c:\python27\lib\site-packages\django\db\models\query_utils.py:43(__init__)
11871 0.123 0.000 1.479 0.000 c:\python27\lib\site-packages\django\db\models\sql\query.py:1206(_add_q)
19785/11871 0.123 0.000 0.210 0.000 c:\python27\lib\site-packages\django\db\models\sql\query.py:1146(need_having)
27699 0.122 0.000 0.489 0.000 c:\python27\lib\site-packages\django\utils\tree.py:87(add)
59369 0.112 0.000 0.185 0.000 c:\python27\lib\site-packages\django\db\models\sql\compiler.py:48(quote_name_unless_alias)
7915 0.110 0.000 0.150 0.000 c:\python27\lib\site-packages\django\db\models\sql\query.py:1371(trim_joins)
7915 0.109 0.000 0.179 0.000 c:\python27\lib\site-packages\django\db\models\sql\query.py:1243(names_to_path)<pre> ---- By file ----
This profile was discouraging. There's no single place I can fix and just make everything run 10x faster. From these profiles we can conclude that Django/SQL/Python data management takes 40% of the time and matplotlib rendering takes 60% of the time. The biggest slowdown in matplotlib is using add_patch
artists, which I need to do to get the custom triple bar graph glyphs.
I'm first going to try optimizing the Django side and hopefully get that 12 seconds down to 3 seconds for a total speedup of around 30%. Not the huge win I want, but it's the best win that is still accessible without losing features.
Actually, the biggest optimization will be caching the image at the end of the simulation run. I'll do that first.
It looks like the largest issue in my GitHub career has finally come to a close.
Zoomed in map for your viewing pleasure:
High Resolution example (click):
I think any further features or fixes should have their own linked Issue.
I've been thinking hard about what the Results Home page needs because it's still obviously lacking after closing #164. The summary statistics on #180 don't seem like they capture any of the real information about how often events occur, the long tail of the outbreak, or how much of the region was affected. In particular it doesn't really say anything useful about the probability distribution. Does the whole thing go haywire 3% of the time? A median doesn't really communicate that. So after processing through what has been said about the Population map and what we learned at Edward Tufte's class, I have a proposed solution.
The reason the Population map is amazing and popular is because it shows you everything about the units infected, which tells you everything else you need to know about spread and control activities. Controls are aimed at keeping the disease from spreading. The map communicates this in a spatial way that's really easy to understand because that's what happens in the real world. The reason the Population map is terrible is because it animates and it only shows you one iteration at a time. That's why it's a toy and no one really likes it.
What would be immensely informative is a map that shows you how far the disease spread, in what percentage of scenarios, and whether or not the controls were effective at stopping it. We start off with a map of all the Unit positions colored according to the percentage of iterations in which that Unit became infected. Secondarily, the background of the map is colored according to the percentage of iterations in which there was a zone created in that area. Zones and Unit will use the same color spectrum (red - yellow) to avoid confusion, but the Zones will be much lighter, and in the background so you can see Unit locations clearly.
Color palette: ['rgb(255,255,204)','rgb(255,237,160)','rgb(254,217,118)','rgb(254,178,76)','rgb(253,141,60)','rgb(252,78,42)','rgb(227,26,28)','rgb(189,0,38)','rgb(128,0,38)']
Implementation: