Better data viz for throughput

dabreegster commented 4 years ago

Throughput is the number of people crossing a road or intersection over time. Check out TimeSeriesCount in sim/src/analytics.rs.

The raw data is too big to store for most scenarios, so there's very basic compression into per-hour X mode counts. But the way that's later plotted in the live line plots (info panels for lanes and intersections) looks strange, because as the live sim progresses, the bucket for that hour fills up until it reaches the typical value at the end. It'd be better to store the more accurate shape of the throughput-over-time graph and make comparisons with that. I've tried using the lttb crate to do this, but the compressed shape only resembles the baseline by the end of the day.

That explanation was probably kind of nonsense, sorry. Basically, if you're interested in data viz, jump on this bug and I'll explain more clearly. :)

JavedNissar commented 4 years ago

Is TimeSeriesCount in sim/src/analytics.rs the only relevant struct?

dabreegster commented 4 years ago

Is TimeSeriesCount in sim/src/analytics.rs the only relevant struct?

Yes.

The bigger explanation of this issue:

As people move around the map, the simulation generates Event::AgentEntersTraversable. Elsewhere in analytics.rs, that event winds up calling record(), breaking down the type of traffic by mode. Then some info panel and layer code in the game later retrieves this information. We want to do two things with it:

1) Display a live-updated plot of people crossing a road/intersection. I have an extremely poor understanding of dataviz, so how do you generally display a count of events that occur at a specific moment in time? IIUC, you pick a window size, like say 30 mins, and then for any point along the X axis, you count the number of events that happened in the past 30 minutes. That's what the sliding Window struct is trying to do.

2) Compare the count between the prebaked baseline Analytics (that's generated by running the simulation for a full day with no edits) against the live Analytics (that's incrementally built up as the player simulates with some edits). If a road has more or less traffic at some time, we want to plot that and also feed into the Throughput::compare_throughput layer. If you imagine having that nice line plot of count-vs-time from the 1st thing, then conceptually what we want to do here is subtract the two lines, so you can see from 3-6am, the count was exactly the same, but then there's 1000 more people crossing for a while.

My first attempt at doing this was to store the raw events -- all of them. TimeSeriesCount has the raw field; you can change the if false to collect this again. IIRC, this worked great; comparing the exact count at any time was possible, because we could sum up the exact count at any moment.

But of course storing all the raw events took way too much space; the prebaked Analytics are bundled in the release, and I think for the two small maps, all the events increased the file size substantially. This won't scale as we get bigger maps running fully and want to include the prebaked file for them too.

So next I switched to the counts field. For every hour and the 4 modes, store the total for that hour. 24 * 4 counts for every road/intersection is way smaller. But this messes up both the plot and the comparison. The live plot basically turns into a weird bar plot. As you move from 3-4am, that bar rises up. Then as soon as you hit 4:01, a new bar starts at 0 and starts climbing. This is way less intuitive than the 30 minute sliding window thing.

It's a way bigger problem for comparing before/after counts. If you have measured 50 cars at 3:30am, but you're comparing to the entire 3-4am prebaked bucket -- which has, say, 200 -- then it looks like that road has less traffic than usual. By 3:59am, the comparison becomes valid, but then the problem starts over again for the 4-5am bucket.

One of my attempts to compromise between meaningful raw data and smaller hour-bucketed data was using https://crates.io/crates/lttb to downsample the raw data. https://github.com/dabreegster/abstreet/tree/lttb was the attempt. This didn't work -- if you downsample 100,000 points covering 24 hours, you get a nice line plot. If you downsample 1,000 points covering just one hour, that shape doesn't at all match up with the first 1/24th of the full line.

There might be a totally different approach to measuring, storing, and comparing throughputs. I don't understand the field of downsampling at all. End braindump

JavedNissar commented 4 years ago

Okay, I'm taking a look at the lttb branch and I'm encountering the following error:

Finished release [optimized] target(s) in 0.28s
Running `/Users/javednissar/Documents/Development/abstreet/target/release/game`
load map...
Loading map ../data/system/maps/montlake.bin
Reading ../data/system/maps/montlake.bin: 0/4 MB... 0.0000s

../data/system/maps/montlake.bin is missing or corrupt. Check https://github.com/dabreegster/abstreet/blob/master/docs/dev.md and file an issue if you have trouble.

invalid value: integer `512`, expected variant index 0 <= i < 7

Is this just me or do you see it as well?

dabreegster commented 4 years ago

The binary map format has changed; I'll rebase the branch against master

dabreegster commented 4 years ago

Rebase done. Keep in mind this quick experiment was applying lttb to the active agent plot, not any of the throughput stuff. You can see the same problem though: the red line is the live data, and it doesn't match up with the blue prebaked data.

Screenshot from 2020-07-11 20-24-26

JavedNissar commented 4 years ago

Hmm, it seems I'm still running into issues. Right now, I'm seeing errors with regard to fetching the data using the updater. Output below:

    Finished dev [unoptimized + debuginfo] target(s) in 0.20s
     Running `target/debug/updater`
> compute md5sum of data/system/maps/udistrict.bin
> compute md5sum of data/system/maps/downtown.bin
> compute md5sum of data/system/maps/west_seattle.bin
> compute md5sum of data/system/maps/ballard.bin
> compute md5sum of data/system/maps/montlake.bin
> compute md5sum of data/system/maps/lakeslice.bin
> compute md5sum of data/system/prebaked_results/montlake/car vs bike contention.bin
> compute md5sum of data/system/prebaked_results/montlake/weekday.bin
> compute md5sum of data/system/prebaked_results/lakeslice/weekday.bin
> compute md5sum of data/system/cities/seattle.bin
> compute md5sum of data/system/scenarios/udistrict/weekday.bin
> compute md5sum of data/system/scenarios/ballard/weekday.bin
> compute md5sum of data/system/scenarios/downtown/weekday.bin
> compute md5sum of data/system/scenarios/montlake/weekday.bin
> compute md5sum of data/system/scenarios/west_seattle/weekday.bin
> compute md5sum of data/system/scenarios/lakeslice/weekday.bin
> download https://www.dropbox.com/s/too29h0nr6s64wm/ballard.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/too29h0nr6s64wm/ballard.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/too29h0nr6s64wm/ballard.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/8s6pjo91spchkfx/downtown.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/8s6pjo91spchkfx/downtown.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/8s6pjo91spchkfx/downtown.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/dkmf2qaa991uxym/lakeslice.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/dkmf2qaa991uxym/lakeslice.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/dkmf2qaa991uxym/lakeslice.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/cu8exdobcdaj6sm/south_seattle.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/cu8exdobcdaj6sm/south_seattle.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/cu8exdobcdaj6sm/south_seattle.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/x9o4lg7fgvy5kr2/west_seattle.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/x9o4lg7fgvy5kr2/west_seattle.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/x9o4lg7fgvy5kr2/west_seattle.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/frg14z9f90qrk8v/weekday.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/frg14z9f90qrk8v/weekday.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/frg14z9f90qrk8v/weekday.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/gzxg2orfu8z9s6x/weekday.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/gzxg2orfu8z9s6x/weekday.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/gzxg2orfu8z9s6x/weekday.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/2sf37gu7nur9o37/weekday.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/2sf37gu7nur9o37/weekday.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/2sf37gu7nur9o37/weekday.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/kasxsyett83oo03/weekday.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/kasxsyett83oo03/weekday.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/kasxsyett83oo03/weekday.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/0cuc7urc0fsgqi5/weekday.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/0cuc7urc0fsgqi5/weekday.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/0cuc7urc0fsgqi5/weekday.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/8ypid1mus95ivni/weekday.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/8ypid1mus95ivni/weekday.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/8ypid1mus95ivni/weekday.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/rvyypxoc0awubn1/weekday.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/rvyypxoc0awubn1/weekday.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/rvyypxoc0awubn1/weekday.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/msb70poj5dl29q2/weekday.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/msb70poj5dl29q2/weekday.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/msb70poj5dl29q2/weekday.bin.zip), continuing
> rm tmp_download.zip
> download https://www.dropbox.com/s/l3lfj8gk232cyrl/weekday.bin.zip?dl=1 to tmp_download.zip
error getting https://www.dropbox.com/s/l3lfj8gk232cyrl/weekday.bin.zip?dl=1: HTTP status client error (404 Not Found) for url (https://www.dropbox.com/s/dl/l3lfj8gk232cyrl/weekday.bin.zip), continuing
> rm tmp_download.zip
thread 'main' panicked at 'Failed to download stuff: ["data/system/maps/ballard.bin", "data/system/maps/downtown.bin", "data/system/maps/lakeslice.bin", "data/system/maps/south_seattle.bin", "data/system/maps/west_seattle.bin", "data/system/prebaked_results/lakeslice/weekday.bin", "data/system/prebaked_results/montlake/weekday.bin", "data/system/scenarios/ballard/weekday.bin", "data/system/scenarios/downtown/weekday.bin", "data/system/scenarios/lakeslice/weekday.bin", "data/system/scenarios/montlake/weekday.bin", "data/system/scenarios/south_seattle/weekday.bin", "data/system/scenarios/udistrict/weekday.bin", "data/system/scenarios/west_seattle/weekday.bin"]', updater/src/main.rs:69:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

dabreegster commented 4 years ago

I think you need to rebase against master again; the URLs changed earlier today

at-tran commented 3 years ago

IIUC, the main problem is that you weren't able to calculate accurate event counts without storing all the events. I think we can solve this with dynamic programing: we store the total number of events from start to all time increments in the simulation. So in order to get the count of events between 3:00 and 4:00, we subtract the total at 4:00 by the total at 3:00. We don't have to store all the events, just the event total event count at every minute. So the memory required scales linearly with the length of the simulation, which is actually constant: 24 hours -> 1440 minutes, or 1440 numbers per mode. Do you think this could work? If yes I'll try implementing it

dabreegster commented 3 years ago

Hmm, I hadn't considered this approach before -- it's interesting. The throughput is tracked per road segment and intersection, so the total storage would depend a bit on map size. Our smaller maps have around 1,000 roads+intersections, and around 10,000 for the larger. So with 4 modes and minute resolution, that'd be about 1440 4 10,000 = 57,600,000 counts. u8 only lets count up to 256 -- maybe that's actually reasonable for one minute on a single road? But if not, we'd need at least a u16. So approx storage cost for a large map would be 109MB. For the smaller map with about 1k objects, about 11MB.

I bet there are lots of optimizations possible:

Some roads see no traffic at all from some modes -- a highway doesn't have any pedestrians or cyclists, and buses venture onto only a fraction of the total roads usually. So we may not need *4 for every case.
Per road or intersection and mode, we wind up with 1440 u16s to store. I'll bet there's some existing compression algorithms we could feed this into.
Maybe 1 minute resolution is still too granular to be useful. If we were fine with 5 or 10 minutes, suddenly the space needed for the large map goes from 100MB to 20 or 10MB -- totally a reasonable cost.

Looking at how the current hour-granularity is stored:

/// (Road or intersection, type, hour block) -> count for that hour
pub counts: BTreeMap<(X, AgentType, usize), usize>

This is pretty silly -- (X, AgentType) as a key and Vec<usize> for the values would be an improvement; why also store the hour offset as a key?

So anyway, I would love some implementation help here! I think switching to this count-at-every-time approach makes perfect sense, and we can play around with the granularity (1 min, 5 mins, 10 mins). Code references:

https://github.com/dabreegster/abstreet/blob/ab19696e0153eca40f1539de391dc1d9a79d71b9/sim/src/analytics.rs#L511 is the thing that handles the raw events and comes up with counts
If you search in the game crate for road_thruput and intersection_thruput, you can find the UI code
Some of that will plug into widgetry/src/widgets/line_plot.rs
https://github.com/dabreegster/abstreet/blob/ab19696e0153eca40f1539de391dc1d9a79d71b9/game/src/app.rs#L69 has some disabled code that prints how much space different fields in Analytics take. This should be useful for figuring out how much we can save with the different techniques

Let me know if you get stuck, and thanks for looking into this!

at-tran commented 3 years ago

Great! I'll get started on implementing it.

dabreegster commented 3 years ago

HN is the new Stack Overflow: https://news.ycombinator.com/item?id=26401935

Example of the current approach's problem: screencast

dabreegster commented 3 years ago

Inspired by yesterday, I tried the linear interpolation approach every hour. Didn't seem to help. Much of the time, the count before for a road is very small -- less than 5. Any percentage of that is also tiny. Should also revisit the color scheme. Going to add tooltips with exact counts, to help debug.

To preserve some of the code:

        let now = app.primary.sim.time();
        // What percentage are we through the current hour?
        let pct = {
            let (_, mins, secs, centis) = now.get_parts();
            let dt = Duration::minutes(mins) + Duration::seconds((secs as f64) + (centis as f64) / 10.0);
            dt / Duration::hours(1)
        };

        let lerp = |hr, count| {
            if hr < now.get_hours() {
                count
            } else if hr == now.get_hours() {
                // Linearly interpolate
                (pct * (count as f64)) as usize
            } else {
                0
            }
        };

Shiandow commented 3 years ago

If the problem is just that the counts are too low then you can probably fix that by just calculating (a + 1) / (a + b + 2) instead of merely a/b (this changes the scale from 0 to infinity to 0 to 1).

If you need a mathematical justification then (a + 1) / (a + b + 2) is the estimate for the Beta distribution that you get when you try to do statistical analysis on this kind of data. This quantity is also symmetric (if you swap 'after' and 'before' you'll just get 1 - this value) and bounded (so no almost infinities) which I think will help.

matthieu-foucault commented 1 year ago

@dabreegster I'd like to try and help with this.

Have you considered using more than one metric to visualize throughput?

We can look at how a tool like Sysdig (distributed system monitoring tool) aggregates data (https://docs.sysdig.com/en/docs/sysdig-monitor/metrics/data-aggregation/), i.e., when downsampling the data to a 1-hour resolution (or lower), we could record 4 metrics:

the per-hour count
the average (or median) count/min
the maximum count/min
the minimum count/min

When visualizing the data, I'd start with looking at the average and maximum throughput. The average value should remain meaningful for hours that are not complete and give you an idea of the overall traffic. The maximum would allow you to see traffic spikes that you wouldn't otherwise see when looking at the average.

Backtracking a bit, we probably want to make sure we understand what actionable information we want to see through this visualization.

dabreegster commented 1 year ago

Backtracking a bit, we probably want to make sure we understand what actionable information we want to see through this visualization.

The bigger picture: you make some edits to roads (or traffic signal timing), run a simulation, and want to understand what areas are seeing more or less traffic relative a baseline of no edits. That could clue you in to unexpected side-effects of your change (you make one road car-free and expect vehicle traffic to divert to a parallel main road, but instead people cut through a neighbourhood). This is helpful to watch in real-time as the simulation runs, since some of the big changes might only show up at certain times of day. So, the goal is for something like https://github.com/a-b-street/abstreet/issues/85#issuecomment-794532989 to meaningfully summarize this information for someone.

Currently there's only one built-in map with prebaked results from running a full baseline simulation. You can cd apps/game; cargo run -- --dev data/system/us/seattle/scenarios/montlake/weekday.bin, open throughput layer (l, then t), and tick "Compare before proposal" to see it. If you make no changes to the map, the colors should stay white, indicating there's no change relative to the original prebaked data.

There are maybe simpler and more effective approaches to this problem than a per-road color scale. And the current scale is possibly too misleading -- if you hover over a dark red road, you see "0 before, 2 after", which is a huge relative increase, but meaningless on an absolute scale.

The average value should remain meaningful for hours that are not complete and give you an idea of the overall traffic

That's clever, I hadn't thought of that! Trying something like Sysdig's approach makes sense to me; I'd be very surprised if this type of problem isn't well-solved in other contexts, and internet traffic is hardly a big domain leap. :)

a-b-street / abstreet

Better data viz for throughput #85