Bears-R-Us / ArkoudaNotebooks

place for notebooks and example uses of the Arkouda software package
MIT License
11 stars 7 forks source link

Discussion about NYC Taxi Notebook #1

Open mhmerrill opened 3 years ago

mhmerrill commented 3 years ago

This issue encapsulates discussion about the NYC Taxi data set example using Arkouda. Notebook here

Yellow Trips Data Dictionary

NYC Yellow Taxi Trip Records Jan 2020

NYC Taxi Zone Lookup Table

mhmerrill commented 3 years ago

@bradcray @buddha314 @reuster986 @timothyneumann1 @ben-albrecht @jt-halbert I would love for you guys to chime in with anything, anyone else I should tag?

bradcray commented 3 years ago

I must not be a data scientist, because my head goes to things like "compute mean, median (requires sorting, right?), mode travel times" which seem trivial compared to some of your suggestions. On the other end of the spectrum, my head goes to "Figure out who owns all the taxi medallions and how much they paid for them", though I suspect that's not a task for this dataset. :D

I haven't really taken the time to look through what's in the data sets yet, though. Will try to do that tomorrow, I'm being called to dinner ATM.

mhmerrill commented 3 years ago

Here are a couple of papers and articles about analysis of the NYC Taxi data.

Anonymizing NYC Taxi Data: Does It Matter?

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance

ben-albrecht commented 3 years ago

Do these datasets contain the same fields as the data from the NYC taxi kaggle competition?

We might find some interesting ideas in those notebooks.

mhmerrill commented 3 years ago

I think it is the same data. Thanks for the links.

mhmerrill commented 3 years ago

20200925: I updated the notebook a bit and uploaded html and pdf of the notebook with output.

mhmerrill commented 3 years ago

@bradcray @buddha314 @reuster986 @timothyneumann1 @ben-albrecht @jt-halbert I would love for you guys to chime in with anything, anyone else I should tag?

@hokiegeek2 i forgot to include you on this.

bradcray commented 3 years ago

I like the idea of looking at the notebooks BenA pointed to. Left to my own devices, and looking a bit at the fields that are available, I wondered whether there were correlations that could be drawn about tip amount as a percentage of fare based on length of ride or where the ride originated or time of day. Something that would try to draw some conclusion based on different axes like that. But I don't feel like I'm enough of a data scientist to know whether that's trivial or difficult or interesting. (example hypotheses: tips are more generous as a percentage of total fare for shorter rides and ones originating in Manhattan).