Evaluate crash model - why is it estimating hundreds/thousands of crashes?

mRaffill commented 11 months ago

what project parameters are causing these large differences between user input and model? try playing around with variables and see how it changes things
project length, functional class/volume class
linear regressions between different parameters
How much impact does each variable have for both user crashes and model crashes
Look at the equation, is there a variable which we can fix

Originally posted by @mRaffill in https://github.com/mRaffill/atp-bc-tool-analysis/issues/2#issuecomment-1721765131

To-do list:

[x] Create linear regressions for individual variables against both user-inputted crashes and model crashes
- [x] Project length
- [ ] Functional class/volume class lengths
- [x] Volume
[ ] Create linear regressions for combinations of variables

mRaffill commented 11 months ago

In 433385f14c6230da535079f25055ca166e1715e5 I added linear regressions for Project length and Project length + volume, and graphed some scatter plots comparing crashes to length and crashes to volume.

Next steps:

[ ] EC model ~ Volume (linear regression with only volume)
[x] split by mode (and outcome?)
[ ] Add functional class/volume class to the table
[ ] EC model ~ functional class/volume class length/count
[ ] Create linear regressions for combinations of variables
- [ ] read up on how to do this
- [ ] total length + volume + functional class/volume class?
- [ ] length * volume?
[ ] EC user ~ … etc all of the above (do user crashes also correlate with these variables? do they correlate but with different coefficients?)

mRaffill commented 11 months ago

Maybe look directly at the ECCmojvf compared to Ljvf and Vmj - this might be easier to compare functional class/volume class

mRaffill commented 11 months ago

After trying a bunch of combinations, the most I've been able to find is that there seems to be a much higher ratio of number of intersections to crashes than there is for number of segments/length of segments to crashes. It almost seems like there is some constant multiplier somewhere which is making intersection crashes much higher than segment crashes?

(add examples/graphs later)

mRaffill commented 11 months ago

Per Dillon's suggestion, I started looking at one individual project and going through each step of the crash equation to see where it started seeming "off."

I found again that Ljvf is way less than the total project length - then I realized that it is just feet vs miles. So that is not actually an issue. The only other difference seems to be that the e^alpha constants (crashes/volume/length or count) are much higher for intersections than segments.
I then tried calculating the crashes at each individual segment and intersection in the project. This allows me to graph the crashes at each intersection vs crashes at each segment:
Intersections definitely have much more crashes than segments, but neither are greater than 1 crash per segment/intersection per year. They're barely even over 0.5 crash per segment/intersection per year.
These crashes/intersection and crashes/segment actually seem kind of reasonable. I am going to try adding up across the entire project and seeing if that is when it starts seeming unreasonable.
- If that is the case, maybe one interpretation is that there just aren't going to be crashes at all of the segments/intersections in the user-inputted data, because they are usually such infrequent events?

mRaffill commented 11 months ago

The total crashes after adding up crashes for each intersection and crashes for each segment are much lower than the crashes calculated using total lengths/counts/volumes and the functional/volume classes equation (what is used currently in the tool). They both use the same initial data from each segment and intersection - so there must be something different in the process to calculate crashes as values are added up across all of the segments/intersections.

I looked again at the tables used to calculate crashes to try and find differences between calculating for individual intersections/segments and calculating for functional/volume classes. The alpha constants are the same. Multiplying by the total length vs multiplying by the individual lengths and then finding the sum should be the same.

sum of (e^alpha_vf length of individual segment) across all segments for each volume/functional class = e^alpha_vf total length across all segments for each volume/functional class
and same for intersections except that there is no length, just the e^alpha values summed up across the total number of intersections

But it seems like there might be an issue with volume. Each segment/intersection has individual volume (pedestrian/bicycle exposure), but the aggregate version uses the total exposure for that mode - not separated by functional class and volume class. So the crashes split by volume and functional class will use the same volume over and over again... I think?

And then since that is then multiplied across all of the segments/intersections (eg. the total count or the total length), it will be like all of the segments/intersections had the total volume instead of their individual volume.

Looking at the equations, this does seem to be the case: $EC{cmoj} = \sum{f}\sum{v}ECC{cmojvf}$ $ECC{cmojvf} = e^{Ɑ{mojvf}} L_{jvf} (EV_{cmj})^{p}$

So $EC{cmoj} = \sum{f}\sum{v}e^{Ɑ{mojvf}} L_{jvf} (EV_{cmj})^{p}$

But $(EV_{cmj})^{p}$ is not split by $f$ or $v$ so it is getting duplicated every time

Two options I can think of: $EC{cmoj} = (EV{cmj})^{p} \sum{f}\sum{v}e^{Ɑ_{mojvf}} L_{jvf}$ Multiply volume separately

$EC{cmoj} = \sum{f}\sum{v}e^{Ɑ{mojvf}} L_{jvf} (EV_{cmjvf})^{p}$ Split volume (since it comes from the segment/intersection properties, it can also be split by volume and functional class)

or just use the approach of calculating crashes for each individual segment/intersection that I tried right now

My question is is my approach valid and the existing approach wrong? or is the inconsistency because my approach has something missing?

mRaffill commented 11 months ago

I also wonder if part of this has to do with the number of intersections selected - in the tool, multiple corners/parts of an intersection are often selected. For example, this one intersection in the individual project I was looking at has 8 sub-intersections selected, which will each be counted as separate intersections when the tool estimates crashes.

But I'm not sure how intersections were counted to get the average crashes/intersection for the state level metrics (for calculating the alpha constants). When those constants were calculated, did one intersection mean the entire intersection or just one side of the intersection?

mRaffill commented 11 months ago

In 433385f I added linear regressions for Project length and Project length + volume, and graphed some scatter plots comparing crashes to length and crashes to volume.

Adding results (some subset of them which I think are the easiest to understand) here for reference:

I tried making a lot of different linear regressions but didn't really know enough about statistics to interpret what I was doing (so I won't include those here for now)
Resorted to just scatter plots of length/intersection count or volume compared to crashes, which ended up looking clearly linear

Intersections

(combined across all modes, outcomes, only for network intersections)

Mode / Outcome:	bicycling	walking	combined
crash
injury
death

Roadways

(combined across all modes, outcomes, only for network segments)

I noticed that intersections were going into the hundreds/thousands but segments were only around 10-30 (which is still high but just sounded "better"), so wondered if the issue with the crash model was possibly some discrepancy between the intersections and segments
The reasons I identified for this difference:
- the alpha constants are much higher for intersections - this is expected and defined in the technical documentation (because most crashes do happen at intersections)
- The length/count was much higher for intersections than segments
- I saw that the total roadway length did not equal Ljvf and thought that might be an issue, but realized it was just because of converting feet -> miles

Mode / Outcome:	bicycling	walking	combined
crash
injury
death

mRaffill commented 11 months ago

So using the approach calculating separately at each segment/intersection (https://github.com/mRaffill/atp-bc-tool-analysis/issues/5#issuecomment-1730729158, https://github.com/mRaffill/atp-bc-tool-analysis/issues/5#issuecomment-1730864833) across all projects:

	Roadway	Intersection
Before
After

Sorry for the bad formatting with all of the weird x-axis labels! Also note that I don't think they're in the same order of Project IDs so they don't necessarily line up. But I think it shows pretty clearly that after taking out the duplicated volume, the crashes become much more like what is expected.

There are still some things which look a bit confusing, like bicycling change in crashes being greater than combined change in crashes (probably has to do with the crash reduction factors).

mRaffill commented 10 months ago

Uh oh... I tried calculating using the equation $EC{cmoj} = (EV{cmj})^{p} \sum{f}\sum{v}e^{Ɑ_{mojvf}} L_{jvf}$ and either I've done something very wrong or this wasn't the issue to begin with, because I am still getting the same thousands of crashes (the results actually look almost identical)...

Compared to what is currently in the tool:

Did I implement something wrong? Is the equation itself wrong or is some other issue with the crash model that this change doesn't address?

mRaffill commented 10 months ago

After going around in circles trying to figure out what was wrong with either the equation or my code implementing it, I started to wonder if it is really valid to multiply the volume outside of the summation and I think that it isn't valid.

Since the existing volume $EV_{cmj}$ is also just the sum of the volumes from each volume/functional class:

$EC{cmoj} = (EV{cmj})^{p} \sum{f}\sum{v}e^{Ɑ_{mojvf}} L{jvf}$ $EC{cmoj} = (\sum{f}\sum{v}EV{cmjvf})^{p} * \sum{f}\sum{v}e^{Ɑ{mojvf}} * L_{jvf}$

This will have cross-multiplication between the volume and Ljvf e^alpha terms. Volume from all volume/functional classes will be multiplied by length e^alpha for all volume/functional classes, instead of only terms with the same volume/functional class. (something like the product of sums is not the same as sum of products)

So I think only the second option for the equation would be valid:

$EC{cmoj} = \sum{f}\sum{v}e^{Ɑ{mojvf}} L_{jvf} (EV_{cmjvf})^{p}$

dtfitch commented 10 months ago

Isn't that last equation what I originally wrote that was also producing wild results?

mRaffill commented 10 months ago

Do you mean the equation that the tool was originally using?

I think the equation in the documentation is $EC{cmoj} = \sum{f}\sum{v}e^{Ɑ{mojvf}} L_{jvf} (EV{cmj})^{p}$ vs this one has volume also split by volume class (v) and functional class (f) $EC{cmoj} = \sum{f}\sum{v}e^{Ɑ{mojvf}} * L{jvf} * (EV_{cmjvf})^{p}$

So the total volume (across all volume/functional classes) then won't be used multiple times and duplicated. Volume comes from the individual segment/intersection properties so it seems like the only real change would be making the process of adding up the volume numbers slightly different.

However: I tried calculating this way but the results are still much larger than calculating for individual segments/intersections and then finding the total. It looks like they have a very similar same pattern of results (shape of the graph) but this approach is scaled much larger. So this equation could have more issues, or the individual segment/intersection method might have issues, or I'm just missing a constant somewhere.

	segments	intersections
individual segments or intersections
by volume or functional class

Anyways I am still very confused about these equations so I'll look at it again tomorrow/over the weekend and hopefully get to understanding them better.

dtfitch commented 10 months ago

okay, I see the difference. It does seem like there is some normalizing constant that is missing somehow. Thanks for continuing the dig!

mRaffill commented 10 months ago

I thought about this more and have a different idea: The equation for the individual segments/intersections where w is the individual intersection or segment number/id would be $EC{cmoj} = \sum{f}\sum{v}\sum{w}e^{Ɑ{mojvf}} * L{jvfw} (EV{cmjvfw})^{p}$ Since alpha is constant across all segments/intersections in the same volume or functional class, this is equivalent to: $EC{cmoj} = \sum{f}\sum{v} e^{Ɑ_{mojvf}} (\sum{w}L{jvfw} * (EV_{cmjvfw})^{p})$

However, the tool currently adds up the volume and length/count across all intersections/segments separately and then multiplies them together: $EC{cmoj} = \sum{f}\sum{v} e^{Ɑ{mojvf}} (\sum{w} L{jvfw}) (\sum{w} EV{cmjvfw})^{p}$

It seems like these equations are not equivalent, again because of the "cross multiplication"/distributive property when multiplying two sums. That might be why the results are so different when adding up the terms in different ways.

I can't mentally process all of these summations to figure out the differences, so I tried writing what terms would actually be added

For one volume, functional class combo (so crashes/intersection/person or crashes/mile/person is constant): (crashes/intersection/person * 1 intersection * people) + (crashes/intersection/person * 1 intersection * people)... does not equal (crashes/intersection/person) * (people + people + people + ....) * (1 intersection + 1 intersection + 1 intersection + ...) **does** equal (crashes/intersection/person) * (1 intersection * people + 1 intersection * people + ...) (crashes/mile/person * miles * people) + (crashes/mile/person * miles * people)... does not equal (crashes/mile/person) *(people + people + people + ....) * (miles + miles + miles + ...) **does** equal (crashes/mile/person) * (miles * people + miles * people + ...) So basically, it looks to me like this is cross-multiplication again in the volume/functional class approach? But the alpha constants were initially made from volume/functional classes. So is it actually valid to use the alpha constants this way to calculate crashes at each individual segment/intersection? (ignoring the e^ and ln for now because those just cancel out) (average crashes/miles or intersections)/(total volume)^p (total crashes/total length or count)/(total volume)^p ((crashes + crashes + crashes + ...)/(1 intersection + 1 intersection + 1 intersection + ...))/(people + people + people + ...)^p ((crashes + crashes + crashes + ...)/(miles + miles + miles + ...))/(people + people + people + ...)^p But these individual internal calculations don't matter because the whole point is to get the AVERAGE across the entire state. So I think it should be reasonable to calculate at individual segments or intersections? (crashes/intersection/person) * (1 intersection * people + 1 intersection * people + ...) ((average crashes/miles or intersections)/(total volume)^p) * (1 intersection * people + 1 intersection * people + ...) Now it seems like another issue is how to deal with the ^0.5 exponent for volume and how that distributes over a volume class vs an individual segment/intersection???

mRaffill commented 10 months ago

Sorted min-max: Roadways Intersections

mRaffill commented 10 months ago

Now it seems like another issue is how to deal with the ^0.5 exponent for volume and how that distributes over a volume class vs an individual segment/intersection???

We discussed this and thought it might be reasonable to apply the safety-in-numbers constant at the individual segment/intersection level, because the literature this comes from does have a "micro-scale" constant. Actually, the constant the tool is using may not even be correct, but before looking into what would be the correct constant to use, Dillon suggested graphing how much the crashes change when the constant changes.

I could graph this pretty easily, but it looks like the constant does have a big impact on crashes. Even a small change like increasing from 0.5 to 0.6 results in almost twice as many crashes.

safety in numbers constant	segments	intersections
0.1
0.2
0.3
0.4
0.5 (current)
0.6
0.7
0.8

I also notice that for very small constants, bicycling crashes are above combined crashes... Maybe this is because applying the constant after adding bicycle+pedestrian volume isn't the same as the total bicycle+pedestrian volume with the constant already applied?

mRaffill commented 10 months ago

Oh, but changing the safety in numbers constant should also change the alpha constant - the equation to calculate alpha includes the safety in numbers constant: $\alpha = \ln(\frac{crashes}{V^{p}})$

So I probably have to go back to that excel file where all of the alpha constants were calculated and also try changing the safety in numbers constants there, and then recalculating crashes with the new alpha constants.

dtfitch commented 10 months ago

Good catch. This variation should be strong though. An exponent of .4 is a 60% reduction from safety in numbers alone.

mRaffill commented 10 months ago

Right, makes sense! I guess it was just surprising seeing that visually.

Taking into account the change in alpha constant, variation is less extreme:

safety in numbers constant	segments	intersections
0.1
0.2
0.3
0.4
0.5 (current)
0.6
0.7
0.8

dtfitch commented 10 months ago

Wonderful! This makes it seem like less of a scary decision to make. Okay, I think we go with the most recent citation and select 0.4. So is this what we should tell Matt to implement? $EC{cmoj} = \sum{f}\sum{v}\sum{w}e^{Ɑ{mojvf}} * L{jvfw} * (EV_{cmjvfw})^{0.4}$

Can you list out each subscript so it is clear for him? Also, we need to give him a new look up table for alpha constants, right? thanks!

mRaffill commented 10 months ago

Can you list out each subscript so it is clear for him?

$EC{cmoj} = \sum{f}\sum{v}\sum{w}e^{Ɑ{mojvf}} * L{jvfw} * (EV_{cmjvfw})^{0.4}$

c = column (safety, per capita, per jobs) m = mode o = outcome (crash, injury, death) j = location type (segment or intersection) f = functional class v = volume class w = project selected segments or intersections (based on what j is)

All of these constants are the same as the benefits calculation documentation, except w.

Also, we need to give him a new look up table for alpha constants, right?

Yes - should I make a new one with python and put it (the script and output) in github/box? Or modify the excel file where the alpha constants were originally calculated? Or some other way?

dtfitch commented 10 months ago

Great thanks! I think a new list of alpha constants in python that pulls from the data Matt has is safer than the spreadsheet.

bicyclingplus / atp-bc-tool-analysis

Evaluate crash model - why is it estimating hundreds/thousands of crashes? #5