Taxi Analysis - Githubissues

JestonBlu commented 7 years ago

@NancyDrew484 @nitroys @rmglazner

I spent a good bit of time this weekend looking at the taxi cab data and I have come across an issue with the data. I dont think we are going to be able to use logistic regression to predict if the cab drivers was tipped. When I looked at the data I noticed that there is not tip data for almost anyone who paid cash for the fare. I confirmed this by looking at the documentation.

I fit some preliminary models, one using logistic regression with the response equal to a binary indicator if the cab driver was tipped with cash customers removed and I also tried doing binomial regression where the response is the proportion of tip to the total fare. Both models were worthless.

I did Shannon's original idea and fit a linear regression model on the amount of tip and the fit was much better, but with an r^2 in the 50s so still not great. So far I have just used the following predictors:

month
rate_code
pickup_time
passenger_count
trip_distance
fare_amount
toll (binary indicator)

At this point would recommend regular linear regression, although we might try a few other methods that Dr. Akleman has gone over.

Probably the most interesting thing about this data is the location data. Ive been messing around with some of the pick up and drop off locations to see if there are any significant differences that we could investigate.

I was thinking that if we could break the locations down into high level districts, then maybe we could use pick up or drop off locations as predictors and then potentially try logistic regression again.

Here are a few density plots I have made with the taxi data on top of a map of New York

profgeraci commented 7 years ago

Looks like a good start, Joseph. Could you share the SAS code that you're using, so that we can all play around with the models?

I've filled in a basic document structure in the Project Document that Shannon established and shared on Google Docs: https://docs.google.com/document/d/10ZL-BVg78XcBX92qKh3xLWf0zxTctK7ROLLHT0NNOpM/edit?ts=58f005c6

I'll need more details about exactly what data we extracted (dates) and how it was sampled (I think you said a stratified sample).

Anne

rmglazner commented 7 years ago

Thank you, Anne and Joseph! Is the Google document you linked to the one that I originally shared, or did Shannon create a separate one? I just want to check so that I write everything in the correct place!

rmglazner commented 7 years ago

I added a short description of background and methods, but I am not sure what else to add. Thoughts?

profgeraci commented 7 years ago

If you use the link above, you'll see the correct document. I can see the text that you added this morning, Rachael. If you click on the words "last edit was xxx minutes ago" near the top of the screen, you can see each person's contributions or edits. Looks like the document is working.

Now we just need to figure out what to write.

rmglazner commented 7 years ago

Great, thank you! The good news is that for this project, we have an 8 page limit so there is more room to add models and to write compared to part 1 of the project.

JestonBlu commented 7 years ago

I just posted a short R script that I used to do some exploratory analysis in the new folder called Taxi.

profgeraci commented 7 years ago

Interesting article in the NYT this morning:

https://nyti.ms/2pwX3Jr

rmglazner commented 7 years ago

Thanks for sharing! An interesting story about tips - My first job was at Dunkin Donuts. There was a tip jar that customers could add money to if they wanted. I soon discovered that the policy of the Dunkin Donuts I worked at was that you had to report your tips so that it could be removed from your paycheck... If customers knew that, they probably wouldn't be tipping at all since that money did not go to the crew members.

If possible, we should try to have some results completed by Friday so that I have time to put together a presentation and record it before Monday. That way we can maybe meet Monday and you can provide feedback for the video. I will create a blank Google presentation document and share it with you all today.

rmglazner commented 7 years ago

I have just shared to presentation document with you all. Here is the link as well: https://docs.google.com/a/tamu.edu/presentation/d/1skXgxYvMRobi20H31xHaBal5kLPRN9Ni310mq-0ZZy8/edit?usp=sharing

profgeraci commented 7 years ago

Hmmmmm, this data doesn't look right. When I do descriptive statistics on the DTA (original) data I get: The average TOTAL_AMOUNT is 0.271 ($0.27?)?

..... or the CTA (processed data):

... the average tip_pct is 39.965, but I don't see a column for the total amount of the trip.

What am I missing here?

nitroys commented 7 years ago

I'm not sure? When I run proc univariate on the sample I posted in code, I get 15.19 as the mean total amount, 1.5 for mean tip amount, and .087 for the average tip percentage of total amount.

JestonBlu commented 7 years ago

I have done a lot of filtering of the original data to get rid of a lot of outliers and potentially bad records.. i narrowed down the data set to only transactions paid by credit card because those are the only records with tip data.

I think we should really try to use the location data so I have been working to break up the lat and lon data points into groups.. so far I have used KMeans clustering to associate each location with a geographic center. This plot shows the centers the algorithm has chosen based on the density of the data.

Next Ive done some geocoding using the google API and so i can assign physical locations to each of the points and associate a pickup or dropoff location with a general area. I think that will allow us to use general locations so see if it plays a role in tip amount or maybe back to logistic regression where we can look at the probability of a passenger tipping.

I should be able to get a prepped data set into SAS tomorrow. So far Ive been mainly working in R because thats my wheelhouse. I think some of these visuals might make our presentation look pretty good to.

Here is another map that shows the clusters of locations I am planning on using.. there are different clusters for pickup and drop off locations so this one is just showing a map of the pickup locations...

What do you guys think?

rmglazner commented 7 years ago

The map with the color-coded grouping is really impressive! I will add it to the project document. I think that narrowing down the data for the reasons you have written here are fine. Can you also upload an image of the drop-off locations? I agree that these maps will make our project look great!

nitroys commented 7 years ago

I think this is super cool. I've never done any geographic assignment stuff like this, so I'm really impressed. I think we could also consider using pickup locations to predict trip distance or total fare amount too. They might be cleaner response variables than tip, and we could use a random coefficient model, where pick up location is a random effect.

JestonBlu commented 7 years ago

Im good with that Shannon, might as well try anything that looks interesting.

Here is the same plot with the dropoff locations

As an example here are the first couple of cluster locations that I pulled from google. If you want to check out the R code, I have updated my exploratory.R script.

> head(i.cen)

  dropoff_latitude dropoff_longitude                                 dropoff.centers dropoffID
1         40.69272         -73.92308         10 Bleecker St, Brooklyn, NY 11221, USA         1
2         40.71593         -73.95394        26 Havemeyer St, Brooklyn, NY 11211, USA         2
3         40.80406         -73.93871         119 E 124th St, New York, NY 10035, USA         3
4         40.77104         -73.98218          1887 Broadway, New York, NY 10023, USA         4
5         40.76621         -73.95610          411 E 70th St, New York, NY 10021, USA         5
6         40.76574         -73.91986 33-02 30th Ave, Long Island City, NY 11103, USA         6

rmglazner commented 7 years ago

Great, thank you! I added the drop-off image to the project report.

JestonBlu commented 7 years ago

Rachel, is there a place where you want me to save some of these plots? Ill probably tweak them over the next couple of days to make them look better.

rmglazner commented 7 years ago

It has honestly been the easiest when you upload the images in this discussion thread (as long as no one else minds!)

JestonBlu commented 7 years ago

Okay, i went ahead and created a folder in Taxi/Plots to store the files there just in case.. this might be a better one showing the 2 side by side... i have also removed the tick marks, text, and centered the titles.

rmglazner commented 7 years ago

Thank you!

JestonBlu commented 7 years ago

I have added some data and an initial SAS script to the Taxi folder. I started to try some simple models, but I figured I would get all of this posted so you all can try your own models with the filtered data.

I thought maybe a beta regression would be good if we were looking at the response being the proportion of tip to fare, but SAS is timing out on my PC when I try to run a bunch of the variables. Anne and Shannon, do you two want to try creating some models with the prepped data set? Im going to continue to do so as well, but I think i will also concentrate on making some more plots for the report and presentation.

I think it would be neat if we find some significant locations or combinations of locations to also try to display that visually on a map as well.

JestonBlu commented 7 years ago

Forgot to mention, the data set in the taxi folder is a CSV. It has about 60K records and I have also attached the cluster locations so if you want to use it, the variables are pickup_location_id and dropoff_location_id... we can cross reference back to the address locations for the report and presentation, but I would just use the codes for now.

profgeraci commented 7 years ago

Joseph: I would love to look at the data and the SAS code you've written, but I can't read either of these two files. Am I missing something? Here's the SAS file: 2017-04-19_1907 And here's the CSV file: 2017-04-19_1907_001

JestonBlu commented 7 years ago

It might be better just to view the script through github. I think we are just using different encodings of the files... its still a csv file even if the windows preview is funny.. Ive copied the sas script below... see if you can run that. You will have to make DTA equal to the csv file location on your computer.

%LET DTA = 'cab_final.csv';

proc import datafile=&dta out=cab;
run;

/* Mixed Model with Random Var Dropoff Locations and Times */
ods graphics on;
proc glimmix data=cab plots=studentpanel;
    class month pickup_time dropoff_time toll_ind pickup_location_id dropoff_location_id;
    model tip_amount = trip_distance 
                       passenger_count 
                       month 
                       toll_ind 
                       / ddfm=kr;
    random pickup_location_id dropoff_location_id pickup_time dropoff_time;
    lsmeans month / adjust=tukey;
run;
ods graphics off;

profgeraci commented 7 years ago

OK, I can read the data files now. I'm not sure what the problem was, but I think one of my computers was corrupting the file.

I think it would be a good idea to have a brief meeting soon (maybe tonight?) so that we can discuss the plan for this project and agree on a model that we want to use.

I'm available tonight or tomorrow night or all weekend. When are you all available?

JestonBlu commented 7 years ago

I can make anytime work. Whatever is best for everyone else.

rmglazner commented 7 years ago

Can we meet at 7 tonight?

Sent from my iPhone

On Apr 20, 2017, at 7:22 AM, Joseph Blubaugh notifications@github.com wrote:

I can make anytime work. Whatever is best for everyone else.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

profgeraci commented 7 years ago

Yes, Rachael. 7pm tonight CT works - that's 8pm ET, right? Is that ok with everyone?

nitroys commented 7 years ago

I can do 7CST, but not 7EST.

JestonBlu commented 7 years ago

Works for me.

JestonBlu commented 7 years ago

Ill be a few minutes late though.

nitroys commented 7 years ago

Thanks for posting everything, Joseph. I took down the data and the code and played around with a few additional models. I've added my code and pdf output to the taxi folder.

I think the models predicting tip amount are the most interesting, personally. We can discuss more tonight, but I wanted to get these up for you guys to look at before then.

rmglazner commented 7 years ago

I keep forgetting that we are in different time zones, sorry! Yes, 7:00 central time works for me.

Google Hangouts work again?

profgeraci commented 7 years ago

Yes, that's great, Rachael.

See everyone at 7pm CST (8pm EST) on Google Hangouts

JestonBlu commented 7 years ago

Ive uploaded a new version of the Taxi data that includes the variable Rate code... I noticed when doing some distribution plots that there appears to be some significant differences between the rate code and tip amount.

profgeraci commented 7 years ago

Here's the correlation and covariance table if you want to use it in the document or the presentation. 2017-04-20_2047

nitroys commented 7 years ago

Hi all,

I've added more models and my output for the best models to the analysis folder. Even though we found fare amount to be more highly correlated with tip amount, it looks like distance gives us a more interesting model. I think this might be because distance is less correlated with the other predictors, whereas fare amount would be related to whether there were tolls, the rate code, etc. There are some interesting interactions going on too.

The second model in that pdf removes the pickup and dropoff time effects, because the standard error for the pickup effect was larger than the variance estimate, so I was concerned about that.

Let me know what you all think, and if you want to see the other models I tried, my SAS code is there as well. The PDF with the output from all of those was too big to upload :)

JestonBlu commented 7 years ago

Thanks Shannon. Can you double check that you uploaded the new PDF? I see the edited SAS code, but the the current PDF in the Taxi folder looks like its the same one from yesterday.

nitroys commented 7 years ago

Did you check the analysis folder? I put it there instead.

JestonBlu commented 7 years ago

Thanks, i think this looks good. Just a couple of questions. Do you think it would be better to make passenger count ordinal, that way each of the coefficients compare to adding one more passenger rather than comparing each to 6 passengers? It would also make sense to me then to make 1 passenger the base level.

Could you also generate the predictions with peason residuals and measure the variance of those residuals so we can get an estimate of goodness of fit on the full and reduced model?

Also would you be able to generate the individual estimates and significane of the random locations in a saved sas dataset? I would like to try to show the significant locations on the map somehow.

rmglazner commented 7 years ago

I think changing passengers to ordinal is a good idea.

profgeraci commented 7 years ago

That's a great idea to show the significant (high-tipping, right?) locations visually. Very eye-catching.

(Who says that statistics can't be sexy?)

nitroys commented 7 years ago

Okay! Thanks for the comments. I agree about passenger count - I had switched it to see each level in the solutions, but I agree the interpretation is better when it's ordinal. Our Pearson residuals look good in both models, just slightly better in the second. You can see them in the updated best_models.pdf. I've also added two SAS files with the solutions to the random effects for both models.

JestonBlu commented 7 years ago

Thanks Shannon, something I noticed about the Full and restricted models.. The full model has a better AIC than the restricted one which is an indication that our smaller model is no better.

Also Im thinking that because we have so much data we should probably try a lot more of the interactions. I seem to remember Dr. Akleman saying that if you have enough data, you should try all of the interactions you can.

I was messing with your code a bit this morning and I saw that the log transformation is kicking out some records where there was no tip. I made an adjustment and added 1 to tip before the transformation and that seemed to make the model diagnostic plots look a lot better. It also made the residual variance a good bit smaller.. but it also looks like it made dropoff time worth keeping.

The code i played with is below. Im having trouble running the complex interactions on my machine. Would you be able to play with this and see what you get? Also I was wondering if we could use an interaction term between pickup and dropoff locations... I think that would be interesting, but it may be taking too much computing time. My computer chokes when I try to run it, but Im running SAS off of a remote server from work so it may just be me.

Let me know what you think.

%LET DTA = 'cab_final.csv';

proc import datafile=&dta out=cab replace;
run;

*proc contents data = cab; run;

data cab;
    set cab;
    log_tip = log(tip_amount + 1);
    log_dist = log(trip_distance);
    log_fare = log(fare_amount);
run;

/* Mixed Model with Random Var Dropoff Locations and Times */
ods graphics on;

title 'Log Model-response and distance logged, month and toll interactions-- Best Model';
proc glimmix data=cab plots=studentpanel;
    class month pickup_time dropoff_time 
          toll_ind pickup_location_id dropoff_location_id 
          rate_code passenger_count;
    model log_tip = log_dist 
                       passenger_count 
                       month 
                       toll_ind
                       rate_code
               passenger_count*month
               passenger_count*toll_ind
               passenger_count*rate_code
               passenger_count*log_dist
               month*toll_ind
               month*rate_code
                       month*log_dist
               toll_ind*rate_code
                       toll_ind*log_dist
               rate_code*log_dist
               passenger_count*month*toll_ind
               passenger_count*month*rate_code
               passenger_count*toll_ind*rate_code
               month*toll_ind*rate_code
                       / ddfm=kr;
    random pickup_location_id 
            dropoff_location_id 
            pickup_time 
            dropoff_time;    
run;

title 'Best Model with time random effects removed';
proc glimmix data=cab plots=studentpanel;
    class month toll_ind pickup_location_id dropoff_location_id rate_code passenger_count ;
    model log_tip = log_dist 
                       passenger_count 
                       month 
                       toll_ind
                       rate_code
                       month*log_dist
                       toll_ind*log_dist
               month*toll_ind*passenger_count
                       / ddfm=kr solution;
    random pickup_location_id dropoff_location_id pickup_location_id*dropoff_location_id dropoff_time;
    output out=PRED pred=p resid=r pearson=presid;
    lsmeans passenger_count / oddsratio adjust=tukey cl;
    lsmeans month / oddsratio adjust=tukey cl;
    lsmeans toll_ind / oddratio adjust=tukey cl;
    lsmeans rate_code / oddsratio adjust=tukey cl;
run;

PROC UNIVARIATE data=pred;
var presid;
run;

ods graphics off;

nitroys commented 7 years ago

Hmm I'm having trouble too. I can try on Monday when I get back to my work computer, as I'm remoting in to our server right now too.

JestonBlu commented 7 years ago

It might just be that there are too many interactions with the 50x50 in the random statement. We might not be able to test that. What if you rip that part out for now?

nitroys commented 7 years ago

Okay, yeah, taking that part out made it work immediately. I called the output interaction_models.pdf, its in the analysis folder. There's a few additional significant interactions here. I don't have time today to really go in and play around with more, but I might tomorrow night.

JestonBlu commented 7 years ago

Sounds good, can you post the code you used to save the random variable estimates as a separate dataset?

nitroys commented 7 years ago

`ods listing close; ods output SolutionR = out.RandomEffectsReduced; title 'Best Model with time random effects removed'; proc glimmix data=cab plots=studentpanel; class month toll_ind pickup_location_id dropoff_location_id rate_code; model log_tip = log_dist passenger_count month toll_ind rate_code monthlog_dist toll_indlog_dist / ddfm=kr solution; random pickup_location_id dropoff_location_id /solution; lsmeans monthLOG_DIST / slice=month slicediff = month adjust=tukey; output out=gmxout pred=pred pred(ilink)=predmu pearson=pearson; lsmeans monthLOG_DIST / slice=month slicediff = month adjust=tukey;
run;

ods listing;`

Its the ods listing command. You can find the name of every table you can export from a procedure by running ods trace on before the procedure, and then ods trace off after it. It prints the names of the tables to your log. That's how I found the one i wanted was called SolutionR.

JestonBlu commented 7 years ago

Okay, here is what I have come with so far for displaying the random effects estimates on the map. I have 2 versions, one that shows the whole area, and one that zooms in on Manhattan because thats where the bulk of the data is.

My interpretation of the random estimates are in percent because the response variable is logged. Right now Im using Shannon's model that she used to generate the random effects estimates. If we make a refinement to the model, all I need is that same table reproduced and I will run it through the code i have written to generate the maps.

My understanding with the random effects is that there is no base level so the percent change that I am I showing would be whatever the predicted tip is times the random effect estimate. So a value of .2 is a 20% tip increase vs an insignificant location. Does that make sense?

A couple interesting takeaways...

There are more significant dropoff locations than pickup locations.
High pickup effect locations are relatively close to the United Nations HQ and JFK Intl Airport
Low pickup effect locations is relatively close to the Empire State Building
High dropoff effect locations are near NYSE, N Bronx, Williamsburg Bridge Area, Hudson River Park, East Harlem, JFK Intl Airport
Low dropoff effect locations are near NY University, Hudson Yards Railway Station

I can refine these more if we do anymore model tweaking

JestonBlu / Neighbor-Works

Taxi Analysis #16