Zero Inflated Count Ride-Hailing Ridership Data

alex-mucci commented 3 years ago

The level of aggregation makes a big difference in the number of zeros in the dependent variable. I’m afraid that the high number of zeros could cause the model to find some weird relationships.

alex-mucci commented 3 years ago

I think I should estimate a model with the most disaggregate data first and see what the results look like. The model might do a good job even with the high number of zeros. - Alex to do

alex-mucci commented 3 years ago

After reading the email from Dr. Erhardt below, I should start with a poisson model because the ride-hailing use is count data. I have emailed him to get a license for stata and will be testing the data for overdispersion. The model structure will likely need to be tweaked because the data is skewed towards zero, but I will test for that skewness and cross that bridge when I get there.

I think this applies to several of you. Vedant in particular, I recommend this for your SF crash models. My general guidance is this.

For continuous variables where negative values are ok—use ordinary least squares regression.
For discrete variables—use a logit model. Usually a multinomial logit.
For non-negative values—use poisson estimation, especially if they are skewed. This is the focus here: a. It estimates y = exp(…), which is almost the same as ln(y) = … but has some advantages that sometimes matter (see the blog post). If you have a lot of values that are zero or close to zero it should be different. If you don’t, the two options should be pretty close (you can test this though).
b. You need to add a special command to get the right standard errors and t-statistics. c. It works well with panel data. If you’re using panel data, use fixed-effects, not random effects. d. Rarely do you want to worry about the negative binomial model—usually it’s not better. e. Usually you only want to use a hurdle model (such as a zero inflated negative binomial) if there is some actual opt-in option, not just because your data happen to have a lot of low-probability cases.
Don’t stress about fancy statistics unless you have to. Do stress about making sure the data are correct, clean, reproducible.
Python is great for data manipulation, but it lags in statistics. If you need some specific statistical method, use R or Stata, and you’re likely to find better examples of exactly what to type. (Or SAS or SPSS, but Stata is what I’m ramping back up on and what we’re putting Brandon on. If you need a license, let me know.)

More details are below. Please bookmark these, and if you’re using this in your thesis/dissertation, be able to explain and defend the choice. (For me, “The guy who wrote the econometrics textbook says its fine.” is good enough, but you may need to say something more meaningful when you write.

alex-mucci commented 2 years ago

There is an issue with the o-d pairs that do not have ride-hail data. I can make the trip total zero for those pairs but it does not make sense to make variables like travel time zero. I can use travel times for the o-d pair in different months but there are o-d pairs without any ride-hail data for any months.

Should I drop out the o-d pairs without any ride-hail data for any months and fill in the average travel time of all months when it is missing for one month? Or use OTP free flow travel time instead of the observed travel time?

alex-mucci commented 2 years ago

Here are some facts about the data:

71% of origin-destination-month-mode records in the estimation file are missing RH data
6,041,057 origin-destination-month-mode records have RH data
The amount of records missing RH data is consistent across the months of the study period
38% of census tract pairs within Chicago are missing RH data for all months of the study period

alex-mucci commented 2 years ago

There are two options for the tax model:

Drop the o-d pairs that do not have RH data.
- This will cause 71% of the records in the dataset to drop out, but will still leave 6,041,057 records to regress over.
- This will not work for the predictive model
Build cost and travel time models for the records missing data
- Duration = Freeflow * B1 + C
- Fare = Distance B1 + Duration B2 + C
- Will use obsesrved duration data to calibrate the model then will use the estimate duration for the records that do not have RH data
- Freeflow will use the OTP free flow travel time
- Will likely log transform Fare (dependent variable) because there are different types of ridehail vehicles. Some more expensive than others. A percent increase instead of a unit increase should apply across all vehicle types better.

Should shared trips have a separate model?
Should I use Fare or Total Cost?

alex-mucci / TNC-Demand-Model

Zero Inflated Count Ride-Hailing Ridership Data #5