Taxi Analysis - Githubissues

JestonBlu commented 7 years ago

@NancyDrew484 @nitroys @rmglazner

I spent a good bit of time this weekend looking at the taxi cab data and I have come across an issue with the data. I dont think we are going to be able to use logistic regression to predict if the cab drivers was tipped. When I looked at the data I noticed that there is not tip data for almost anyone who paid cash for the fare. I confirmed this by looking at the documentation.

I fit some preliminary models, one using logistic regression with the response equal to a binary indicator if the cab driver was tipped with cash customers removed and I also tried doing binomial regression where the response is the proportion of tip to the total fare. Both models were worthless.

I did Shannon's original idea and fit a linear regression model on the amount of tip and the fit was much better, but with an r^2 in the 50s so still not great. So far I have just used the following predictors:

month
rate_code
pickup_time
passenger_count
trip_distance
fare_amount
toll (binary indicator)

At this point would recommend regular linear regression, although we might try a few other methods that Dr. Akleman has gone over.

Probably the most interesting thing about this data is the location data. Ive been messing around with some of the pick up and drop off locations to see if there are any significant differences that we could investigate.

I was thinking that if we could break the locations down into high level districts, then maybe we could use pick up or drop off locations as predictors and then potentially try logistic regression again.

Here are a few density plots I have made with the taxi data on top of a map of New York

nitroys commented 7 years ago

I've uploaded another version of the interaction models pdf. I'm not really sure which one is best here though...hoping you all can give some feedback.

The first model includes every possible interaction of fixed factors. The five-way interaction is significant, so I didn't remove any of the lower level interactions. But, a lot of the estimates are missing, and I'm not sure why there's not enough degrees of freedom for those levels...any ideas?

The second model I got by including only up to three way interactions and then removing insignificant interactions. The last model is the one Joseph had coded above. The AIC is smallest on the first model, but I don't like all the missing values.

My code is there too, it's cab3.sas.

rmglazner commented 7 years ago

I really like the random effects maps! I will add them to the report later today. Should I move forward with recording a draft of the presentation today as originally planned? I am not sure which model is considered our final model at this point, so I do not know which information to focus on for that.

profgeraci commented 7 years ago

Hi all. It looks like we are nearing consensus on the models. I have some time today to work on the report (thanks for getting it started, Racheal). I see that there are 3 PROC GLIMMIX reports in the latest PDF file - later today I will print them all out.

Are we still going to present a FULL model, then a REDUCED model? Or are we going to do more than that?

It seems like Rachael can't really record a presentation until we decide which models we're going to present.

JestonBlu commented 7 years ago

Im tweaking the models Shannon posted right now so i wouldn't bother adding them to the report yet. Ill post my feedback soon, around 1pm CST. I think we should present what we tried for the full model, but not go into any details, and just focus on the reduced model.

JestonBlu commented 7 years ago

Okay, I have made some edits to the models... the main thing I changed was passenger_count to a class variable and I removed distance from all of the interactions so that only the class variables have interactions. This appears to make the the model diagnostic plots look much better.

I have uploaded two files with pdf output (Final_Full_Model, Final_Reduced_Model). The full model is the same model Shannon listed only with the changes I made above. The 4way interaction was insignificant so I removed it, then I removed each of the 3way interactions except for one that was significant, then I stopped.

Now a lot more of the LSMeans tables are populated. I think the issue before was the N-way interactions were causing the main effects and 2way interactions to be confounded and unestimatable.

Since there are so many large estimate tables, I think we should focus the presentation on the smaller main effects tables or simple two way interactions like toll_ind and rate code for example.

The only issue I have is that the full model has a lower AIC then the Reduced Model and I think that may come up... I wonder if that is a byproduct of having a lot of data and relatively few predictors.

What does everyone think about these models? The code is cab4.sas

rmglazner commented 7 years ago

I will wait until this afternoon to create the presentation, so I will have something uploaded later tonight. This will give everyone time tomorrow to watch it before we meet again. What time works for everyone to meet tomorrow afternoon or evening?

JestonBlu commented 7 years ago

Same time as last week, 7cst works for me

rmglazner commented 7 years ago

The models look good! The reduced model still contains some insignificant interactions, is that okay? Specifically, tollpassenger and monthrate. Passenger is significant individually, but toll, month, and rate are not. I am assuming those predictors were kept in the reduced model because of their other significant interactions.

rmglazner commented 7 years ago

7 central time works for me too!

profgeraci commented 7 years ago

Yes, 7pm CT is good (8pm ET).

JestonBlu commented 7 years ago

Ive updated the significant effects map with the latest model. Ive also added a chart of significant effects for dropoff hour which seemed to be significant.

Most of the original statements about the significant locations were true even after the model revision, but a few have changed so here it is:

There are more significant dropoff locations than pickup locations.
High pickup effect locations are relatively close to the United Nations HQ and Nomad District
Low pickup effect locations are areas around the Empire State Building and Hunters Point on Long Island
High dropoff effect locations are near NYSE, N Bronx, Williamsburg Bridge Area, Hudson River Park, East Harlem, JFK Intl Airport *Low dropoff effect locations are near NY University, Hudson Yards Railway Station

JestonBlu commented 7 years ago

@rmglazner yeah, since the higher order interactions were significant, I believe you aren't supposed to remove the lower ones even if they show as being insignificant. Im not 100% sure if thats a hard rule. The main reason I didnt remove any more is because of the AIC/BIC scores. Since the full model has the lowest score I think its technically the best model, so in this case I think its better to leave in as many variables as possible once the the highest order interactions are significant.

rmglazner commented 7 years ago

That makes sense, thank you for explaining!

nitroys commented 7 years ago

I feel like the type III estimates table for the full model is still weird...I don't like the infinite F values and the p-values of 1. I'm not sure why it's happening, but I'm guessing it has something to do with the number or interactions in the model and whatever the kenward roger calculation does. So I lean toward the reduced model, despite the higher AIC. The corrected AICs are the same as the regular AICs here, which is strange to me, so I'm not sure how much i trust them.

I second what Joseph says about keeping in lower level interactions when the high level interaction is significant. It's similar to keeping insignificant main effects when the interaction is significant, just a few levels up.

So overall, I think the reduced model is better for our purposes. The residuals look similar in each, and I don't have a lot of stock in the AIC values, especially when they're roughly only 1% lower in one model. Also, 7pm CST tomorrow works for me.

JestonBlu commented 7 years ago

I think AIC and AICc converges as observations increase. I agree with everything Shannon has said.

rmglazner commented 7 years ago

Based on this information, I will move forward with the reduced model. I have a meeting for another group project at 5:00 today, so I will start working on our presentation after that. I should have a video up tonight, which gives everyone time before tomorrow evening's meeting to view it and think of any changes to be made!

JestonBlu / Neighbor-Works

Taxi Analysis #16