SiRumCz / CSC501

CSC501 assignments
0 stars 1 forks source link

Data pre processing #44

Closed soroushysfi closed 4 years ago

soroushysfi commented 4 years ago

First I think we have to come up with ways to clean up the data, because when me and Kevin looked at it we found many noises and repetitive records. I listed some of the bad data I would love it if you guys could see if you can find any other.

  1. Dates of some records have exceeded from the present date! I think we have to filter them out.
  2. There are some duplicate records, So we have to use distinct.
  3. Almost all the records have the same value inimprovement_surcharge column(0.3) except the ones that their distance were 0. So we can also remove this column.
  4. store_and_fwd_flag column somehow tells if the taxi had internet connection or not. Almost all of the taxis had internet connection. I think we can also remove this column but before we do it we have to run a query to be sure. If the count is not that big we can ignore it I guess.
  5. mta_tax column had the same amount(0.5) for the trips between different districts but the trips in the same district have 0 value for this column.

There was this issue of mapping district numbers to coordinates. I found a file that contains district numbers mapped to their names and cooridnates. taxi_zones.dbf.zip

SiRumCz commented 4 years ago

to view New York's geomap, we can download shapefile from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page, then use online viewer such as https://mapshaper.org to open .shp file. There is existing python library for opening and editing shapefile, I will see how we could use it.

soroushysfi commented 4 years ago

So for today's presentation we can show the map and talk about different kinds of visualization we can do like heat map, trips paths(connecting start point and end point with a link), compare different trips according to their payment method, filtering out trips with total_amount higher than $10, and etc(any other insight we can extract). Also at first we can talk about how we can clean the data to be more meaningful. First we could filter out paths that are between different districts(not the trips inside a district). Then we can show it like this: Screen Shot 2019-10-07 at 9 50 41 AM I don't know if it turns into a clutter or not(if it does we can think about how to sparse it a little). We can have a heat map where we could show trips inside each district(in my opinion showing the trips inside a district with an arrow won't be that much meaningful since we don't want to be really specific about the coordinates and the paths). Screen Shot 2019-10-07 at 9 56 18 AM

Any other thoughts?

soroushysfi commented 4 years ago

Could we calculate longest trips?

I think we could do that with euclidian distance because we can map each district number to its coordinates.

soroushysfi commented 4 years ago

If you guys are ok I will send this pdf to Sean for our presentation today. Data Pre processing.pdf

soroushysfi commented 4 years ago

Its not showing the leaflet maps in the .html I sent... Here's a screenshot. I put the notebook on a branch too. Screen Shot 2019-10-06 at 5 45 13 PM

Do you want to do the presentation? because I don't know how the map you're working with works and it is nice to show it in our presentation. It's ok if you don't want to use the pdf I sent, which ever you're comfortable with.

soroushysfi commented 4 years ago

I talked to Sean about plugging in my laptop so I could show the maps - I also cleaned the notebook up. I could do the presentation. Could I show the maps and then maybe you could talk a little about cleaning up the data with the pdf?

Yeah that sounds good.