dpzhang / Project_AHDA

CAPP30123 Course Project
0 stars 2 forks source link

Road-map #1

Open dpzhang opened 7 years ago

dpzhang commented 7 years ago

Research Question: Are there any differences in driving patterns between high-income taxi drivers and low-income drivers before and after 2016?

  1. How to define high-income and low-come driver?

    • classify them by averaged daily income
  2. How to quantify driving pattern?

    • aggregate all trips by census tract
    • condition on pick up and drop off location (census tracts) and condition on time (hourly), we want to compare the actual distance (velocity?)
    • how long does it take, on average, to pick a passenger between high-income and low-income drivers?
dpzhang commented 7 years ago
  1. Classify high-income and low-income drivers
    • Based on medallion ID, we can plot a distribution of average yearly income
      • sum all trip fares based on unique taxi ID within 1 year for each driver
      • divide the sum by the number of years each unique taxi operated from 2013-2017
        • Why weighted annual income? There are drivers who drive more trips while other drivers drive less trips within the same time-span, so look at the weighted annual income could be less biased.
      • plot averaged daily income to get a sense of the distribution
      • classify and label high-income and low-income drivers from that distribution
    • Useful statistic: Weighted averaged fare a driver earned per year
dpzhang commented 7 years ago
  1. How to quantify driving pattern in general?

    • Categorize dates into three different levels: weekdays, weekends, and holiday

    • Aggregate all taxi trips by 801 Chicago census tract

      • Taxi Trip Dataset: Pickup Census Tract and Drop-off Census Tract
      • Census Tract Boundary File: statefp10+countyfp10+tractce10
      • Some census tract might not even have any taxi pickups in four years, in this case, we need to remove those census tracts.
    • Conditioning all trips by time

      • each temporal unit to be 3 hrs (6-9, 9-12, 12-15, 15-18, 18-21, 21-0, 0-3, 3-6)
    • Conditioning all trip flows by 9 regions:

      • North to East, South, West
      • South to North, East, West
      • East to North, South, West
      • West to North,South, East
    • After conditioning each trip by spatial unit and temporal unit, we need to study the flow from CT1 to CT2 or from CT2 to CT1

      • Comparing driving distance from CT to CT

        • Challenge: the geographical sizes of each census tract vary, so every trip from CT1 to CT2 might have large variation in distance per se, so need to find a way to standardize
        • Solution: compute a statistic by dividing actual trip miles by distance on map
        • Interpretation: we want to see for every actual mile of the trip, what is the number of extra miles that drivers decide to take, and what is the difference of those extra miles between high-income and low-income drivers?
      • Comparing driving velocity from CT to CT

        • Question: Using the statistic we computed for distance, we want combine it with velocity.
        • Why?: For example, high-income drivers tend to drive a longer distance from CT1 to CT2 comparing with that of low-income drivers? Is it because high-income drivers would tend to take longer route to avoid congested roads, but, at the same time, they would be able to drive faster so faster velocity would be able to compensate the longer route taken?
dpzhang commented 7 years ago
  1. More detailed and specific study of driving pattern?
    • Hypothesis: low-income drivers, do they just think they will earn more by staying in downtown, or staying in places where is more populous, while high-income drivers willing to go to the outskirt neighborhoods where trips are more likely to be longer?
    • By looking at the pick-up and drop-off locations of high-income drivers, we want to get a sense of which census tracts do these drivers typically visit. If is, are there any characteristics among those neighborhoods in common?
    • How do we quantify "good" or "bad" neighborhoods?
      • crime rate?
      • averaged income?
      • black/hispanic population?
dpzhang commented 7 years ago

8 Variables need to add to the raw dataset:

  1. Region:
    • In the census track shapefile, there are community id maps on each unique census track.
    • The Chicago 77 is also classified by 9 different regions:
      • classify community by regions
      • classify census tracts by community region
  2. Absolute Distance from pickup coordinate to drop-off coordinate
  3. Ratio of real path length over shortest path length (RRSL)
  4. Absolute Velocity: Absolute Distance / Trip Duration
  5. Relative Velocity: Relative Distance / Trip Duration
  6. Ratio of real velocity over relative velocity (RRVV)
  7. Time Period: 8 levels as classified above
  8. Day: Indication of if weekday, weekend, or holiday
ningyin-xu commented 7 years ago

Feedbacks: What if for some particular trips only high income drivers doing the trips but not the low income drivers?

People who work for tips: Tip variable correlated with people from low-income neighborhood?