Closed toliwaga closed 7 years ago
This is caused because one tour ((19970) has zero probabilities for all modes because all NaN utilities
probs.loc[19970,].sum(axis=0) = 1.8e-299
Alternative
BIKE 1.000000e-300
DRIVEALONEFREE 1.000000e-300
DRIVEALONEPAY 1.000000e-300
DRIVE_COM 1.000000e-300
DRIVE_EXP 1.000000e-300
DRIVE_HVY 1.000000e-300
DRIVE_LOC 1.000000e-300
DRIVE_LRF 1.000000e-300
SHARED2FREE 1.000000e-300
SHARED2PAY 1.000000e-300
SHARED3FREE 1.000000e-300
SHARED3PAY 1.000000e-300
WALK 1.000000e-300
WALK_COM 1.000000e-300
WALK_EXP 1.000000e-300
WALK_HVY 1.000000e-300
WALK_LOC 1.000000e-300
WALK_LRF 1.000000e-300
Name: 19970, dtype: float64
utilities.loc[19970,]
Alternative
BIKE NaN
DRIVEALONEFREE NaN
DRIVEALONEPAY NaN
DRIVE_COM NaN
DRIVE_EXP NaN
DRIVE_HVY NaN
DRIVE_LOC NaN
DRIVE_LRF NaN
SHARED2FREE NaN
SHARED2PAY NaN
SHARED3FREE NaN
SHARED3PAY NaN
WALK NaN
WALK_COM NaN
WALK_EXP NaN
WALK_HVY NaN
WALK_LOC NaN
WALK_LRF NaN
Name: 19970, dtype: object
choosers.loc[19970,]
destination 1439
destination_in_cbd NaN
person_id 820171
tour_departure_and_duration 36
tour_num 1
tour_type school
start 6
end 23
duration 17
make_choices in mnl.py probably should check for probs not summing to 1 or being NaNs?
def make_choices(probs):
"""
Make choices for each chooser from among a set of alternatives.
Parameters
----------
probs : pandas.DataFrame
Rows for choosers and columns for the alternatives from which they
are choosing. Values are expected to be valid probabilities across
each row, e.g. they should sum to 1.
"""
Tour 19970 was assigned destination TAZ 1439 which has a dest_density_index Nan because land_use.household_density and land_use.employment_density in that TAZ are both 0 so density_index is undefined.
@orca.column("land_use")
def density_index(land_use):
return (land_use.household_density * land_use.employment_density) / \
(land_use.household_density + land_use.employment_density)
Thus the 10 rows in tour_mode_choice.csv that use dest_density_index eval to NaN for this tour
Thus the Utilities computed by the cross product with spec are likewise Nan when
utilities = model_design.dot(spec)
utilities.loc[19970,]
Alternative
BIKE NaN
DRIVEALONEFREE NaN
DRIVEALONEPAY NaN
DRIVE_COM NaN
DRIVE_EXP NaN
DRIVE_HVY NaN
DRIVE_LOC NaN
DRIVE_LRF NaN
SHARED2FREE NaN
SHARED2PAY NaN
SHARED3FREE NaN
SHARED3PAY NaN
WALK NaN
WALK_COM NaN
WALK_EXP NaN
WALK_HVY NaN
WALK_LOC NaN
WALK_LRF NaN
Name: 19970, dtype: object
Which makes the probabilities for tour 19970 all NaN
Which causes the groupby at line 66 of make_choices in mnl.py to come up one row short:
choices = (s.iat[0] for _, s in pd.Series(cols).groupby(rows))
And so the choices array has one less row than probs.index so the attempr to create a series fails:
return pd.Series(choices, index=probs.index)
ValueError: Wrong number of items passed 1277, placement implies 1278
Tour model shouldn't choose destination TAZ where employment_density is 0?
But this is a school tour, so maybe it is a legitimate destination?
Here is the dest TAZ 1439 land use:
DISTRICT 34
SD 34
COUNTY 9
TOTHH 0
HHPOP 0
TOTPOP 4855
EMPRES 0
SFDU 0
MFDU 0
HHINCQ1 0
HHINCQ2 0
HHINCQ3 0
HHINCQ4 0
TOTACRE 26
RESACRE 0
CIACRE 13
SHPOP62P 0.05
TOTEMP 0
AGE0004 5
AGE0519 98
AGE2044 3598
AGE4564 1050
AGE65P 104
RETEMPN 0
FPSEMPN 0
HEREMPN 0
OTHEMPN 0
AGREMPN 0
MWTEMPN 0
PRKCST 0
OPRKCST 0
area_type 4
HSENROLL 0
COLLFTE 0
COLLPTE 0
TOPOLOGY 3
TERMINAL 7.3102
ZERO 0
hhlds 0
sftaz 1439
gqpop 4855
employment_density 0
total_acres 26
total_households 0
density_index NaN
county_name Marin
household_density 0
county_id 9
total_employment 0
Name: 1439, dtype: object
0 households, population 4855, high school and college population is 0, but group quarters population GQPOP is 4855 - I'm guessing this is either a barracks, a monastery, or a prison?
So probably shouldn't be the destination of a school tour?
San Quentin State Prison?
This is a very good catch. There are a few issues/observations here:
1) This is a destination with no school enrollment, why it was picked as a school tour destination? Probably due to the shadow pricing mechanism, used to enforce consistency between school location choices and enrollment inputs, is not in place in the school location choice model yet?
2) Employment density (population and dwelling unit densities too in the SANDAG case) is used as a proxy to represent urban forms which affect the likelihood of choosing non-motorized modes (bike and walk). Employment density not only affects tour/trip mode choice for work purpose, but also affects trip/tour mode choices for other purposes, including school trips.
3) More importantly, the all NaN probability situation could also happen in other places, such as stop location choices per the SANDAG experience. We put temporary fixes to default choices to an alternative we think is relatively reasonable to get by. However, it would be great to have a more sounding solution to the no alternative is available cases like this.
4) @toliwaga suggestion of checking the sum of probability equal to 1 is a good way to detect (not solve) problems like this.
Crash fixed in #82 but leaving open to remind us to add check for probs not summing to 1 or being NaNs
props currently end up not summing to 1 rather frequently. Specifically, when all the utilities are large negative numbers.
this happens for instance in non_mandatory_scheduling call to vectorize_tour_scheduling when all utilities are between -990 and -999
When this happens, all the exp(utils) are zero and so the conversion to probabilities zero-divides when it does this:
prob_min = 1e-300
prob_max = np.inf
utils_arr = np.exp(utils.as_matrix().astype('float'))
arr_sum = utils_arr.sum(axis=1)
if np.isinf(arr_sum).any():
raise RuntimeError('utilities have infinite values')
np.clip(utils_arr, prob_min, prob_max, out=utils_arr)
np.divide(
utils_arr, arr_sum.reshape(len(utils_arr), 1),
out=utils_arr)
utils_arr[np.isnan(utils_arr)] = prob_min
np.clip(utils_arr, prob_min, prob_max, out=utils_arr)
Moving the computation of arr_sum to after the clip makes mathematical sense and appears to solve the problem.
Anyone see any obvious reason not to make this change?
Note: this wasn't getting caught because numpy.divide() wasn't throwing a zero-divide error. This is apparently a pandas feature.
Running this:
import numpy as np
print "np.geterr()", np.geterr()
import pandas as pd
print "np.geterr()", np.geterr()
Yields this:
np.geterr() without pandas: {'over': 'warn', 'divide': 'warn', 'invalid': 'warn', 'under': 'ignore'}
np.geterr() with pandas: {'over': 'ignore', 'divide': 'ignore', 'invalid': 'ignore', 'under': 'ignore'}
Why probabilities are not summing up to 1 frequently? If this happens only when the util has large negative values like -999, it is kind of understandable. If this also happens to util with values not 'too' negative (such as -200), then it is a significant issue, because it is likely the -200 is a legit choice util, but then the math engine can't get the probabilities computed right. Is this a precision issue in Python when it come to mathematic computation?
For the clipping, what if we have a case like this:
Two alternatives available. Alternative 2 is meant to be blocked, so the utility is assigned a value -999.
before clipping: util1=-100, util2=-999 after clipping: util1=-100, util2=-300 (if I understand the clipping correctly)
Now, both choices will have a valid probability computed for them. But then the Monte Carlo could well pick alternative 2, which was really meant to be a non-valid choice.
The existing CT-RAMP has the same issue of probabilities not summing up to 1. One odd case we had was something like this: Choice 1 is assigned with a probability 0.95, choice 2, 3, and 4 are blocked with -999 utils. In Monte Carlo, if the generated random number <0.95, then it is ok and alternative 1 is chosen. But, there are times the random number is >0.95 and no alternative can be chosen and caused model bombing.
The large negative values are like -998, -997 not -200
In general, every choice should have at least one alternative available. If not, then the problem is usually due to a data issue (often in the skims) in combination with an error in expression definitions. Ultimately we need to log when a choice cannot be made and exit the simulation.
Also, it looks like we need to review/audit the expressions to ensure we don't get any -998, -997s since these are probably the result of errors in the expression definitions. In CT-RAMP, once a -999 was encountered, the rest of the expressions in the UEC were skipped and the alternative was turned off. We need to think some more about how best to incorporate the treatment of NAs in the system.
It seems to me there are two different issues: 1) probabilities not summing up to 1 2) all choice alternatives are blocked with larger negative utils
For issue 2, as @bstabler mentioned, it is caused by data, we need to log the error with informative info so the users know where to check data.
For issue 1, a lot of times it happens when one or more of the alternative is blocked with large negative utils. We might be able remove the blocked alternatives from the alternative set, only leave valid alternatives in the set, and it might help to resolve sum not 1 issues.
Regarding mode choice nesting coefficients, we also have to be careful about how they are applied
alogit style:
Unest = ln(sum(exp(U))
or
utility style (McFadden)
Unest = 1/lambda * sum (exp(lambda * U))
and whether there are any "dummy" nodes to correct for the problems of alogit style
as @wusun2 points out, the probabilities are not summing to 1 since the alternatives with -999 are being included in the calculation of probability. What we need to do is specify a global NA value (such as -999) and then if an expression returns that value for an alternative, then we remove that alternative before calculating probabilities. So we have two issues to fix and I'm going to create new issues for them so we can have a more specific discussion about just those issues. See #93 and #94.
I'm going to close this issue now since we have a complete run. I'm sure we'll run into additional issues, but we'll make those individual issues instead.
Runtime error in tour_mode_choice_simulate running the full zone dataset with 12000 HH sample and random seed of 0
settings.yaml:
simulation.py:
Stack trace on error: