Closed soroushysfi closed 4 years ago
to view New York's geomap, we can download shapefile from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page, then use online viewer such as https://mapshaper.org to open .shp file. There is existing python library for opening and editing shapefile, I will see how we could use it.
So for today's presentation we can show the map and talk about different kinds of visualization we can do like heat map, trips paths(connecting start point and end point with a link), compare different trips according to their payment method, filtering out trips with total_amount higher than $10, and etc(any other insight we can extract). Also at first we can talk about how we can clean the data to be more meaningful. First we could filter out paths that are between different districts(not the trips inside a district). Then we can show it like this: I don't know if it turns into a clutter or not(if it does we can think about how to sparse it a little). We can have a heat map where we could show trips inside each district(in my opinion showing the trips inside a district with an arrow won't be that much meaningful since we don't want to be really specific about the coordinates and the paths).
Any other thoughts?
Could we calculate longest trips?
I think we could do that with euclidian distance because we can map each district number to its coordinates.
If you guys are ok I will send this pdf to Sean for our presentation today. Data Pre processing.pdf
Its not showing the leaflet maps in the .html I sent... Here's a screenshot. I put the notebook on a branch too.
Do you want to do the presentation? because I don't know how the map you're working with works and it is nice to show it in our presentation. It's ok if you don't want to use the pdf I sent, which ever you're comfortable with.
I talked to Sean about plugging in my laptop so I could show the maps - I also cleaned the notebook up. I could do the presentation. Could I show the maps and then maybe you could talk a little about cleaning up the data with the pdf?
Yeah that sounds good.
First I think we have to come up with ways to clean up the data, because when me and Kevin looked at it we found many noises and repetitive records. I listed some of the bad data I would love it if you guys could see if you can find any other.
improvement_surcharge
column(0.3) except the ones that their distance were 0. So we can also remove this column.store_and_fwd_flag
column somehow tells if the taxi had internet connection or not. Almost all of the taxis had internet connection. I think we can also remove this column but before we do it we have to run a query to be sure. If the count is not that big we can ignore it I guess.There was this issue of mapping district numbers to coordinates. I found a file that contains district numbers mapped to their names and cooridnates. taxi_zones.dbf.zip