SiRumCz / CSC501

CSC501 assignments
0 stars 1 forks source link

2018 112M taxi trips dataset filtering and 5 new APIs #59

Closed SiRumCz closed 4 years ago

SiRumCz commented 4 years ago

This pull request contains 1)stream storing 112M dataset, filtering dataset, creating temporary/cached tables for new APIs and removing dataset. 2) new 5 APIs:

The structures and functions are same as old APIs' except for the last one where I changed the date to month as Soroush has requested.

To save time (this script will take roughly 5+ hours to run), the database file that is processed by myself using the script can be downloaded here: https://drive.google.com/open?id=1POc_uU6bnQKImJsnsS5QQrdPVk5x01X0

Here are the tasks what I have done:

  1. I downloaded and stream stored it to a local sqlite database
  2. I then applied two filters to the original data: 1) exclude trip data with pickup/dropoff datetime outside of 2018 timeframe. 2) remove duplicate data. This step dropped 9,433,333 trip data.
  3. I tried querying data for /payment-trend-timeline, and the api responded after 15 minutes, I thought this is taking way too long and decided to move on to create temporary/cached tables for 5 APIs. After this step, the query time for each API is <1 sec.
  4. In order to make our database sharable for team members, I removed the 112M 2018 taxi trips data from the database and only keep 5 temp/cached tables for API queries. The size of the db file now is dropped from 11.8GB to 110MB.
SiRumCz commented 4 years ago

@soroushysfi I have added two more interval tree (passenger number & date period) APIs for both taxi-sample and 112M 2018 dataset.

Noted that the fromDate and toDate are different in these two APIs. I will also be updating the assignment2.db file in my google doc, https://drive.google.com/open?id=1POc_uU6bnQKImJsnsS5QQrdPVk5x01X0.

soroushysfi commented 4 years ago

@soroushysfi I have added two more interval tree (passenger number & date period) APIs for both taxi-sample and 112M 2018 dataset.

  • /interval-tree-passengers-2018:
Interval tree data. X: month period of 2018, Y: passenger number.
  {
    "YAxisMax": 192,
    "data": [
      {"fromDate": 1, "passengerNum": 0, "toDate": 12},
      {...},
      ...
    ]
  }
  • /interval-tree-passengers:
Interval tree data. X: year-month period of taxi sample data, Y: passenger number.
  {
    "YAxisMax": 8,
    "data": [
      {"fromDate": "2018-12", "passengerNum": 0, "toDate": "2018-12"},
      {...},
      ...
    ]
  }

Noted that the fromDate and toDate are different in these two APIs. I will also be updating the assignment2.db file in my google doc, https://drive.google.com/open?id=1POc_uU6bnQKImJsnsS5QQrdPVk5x01X0.

I get the error: No such file or directory: '2018_Yellow_Taxi_Trip_Data.csv' do you have the file?

soroushysfi commented 4 years ago

and I thought the data was going to be like node and links. This is not the format we talked about.

SiRumCz commented 4 years ago

and I thought the data was going to be like node and links. This is not the format we talked about.

you don't have to run setup_db.py, it will take hours to run, I have uploaded my assignment2.db file in Google drive, you only need to download it and run app.py.

SiRumCz commented 4 years ago

and I thought the data was going to be like node and links. This is not the format we talked about.

Are you talking about the interval tree data?

soroushysfi commented 4 years ago

and I thought the data was going to be like node and links. This is not the format we talked about.

Are you talking about the interval tree data?

Yes. Because I thought we were going to visualize as node-link diagram

SiRumCz commented 4 years ago

and I thought the data was going to be like node and links. This is not the format we talked about.

Are you talking about the interval tree data?

Yes. Because I thought we were going to visualize as node-link diagram

could you give me some samples to look at? I am not sure how it should look like.

soroushysfi commented 4 years ago

and I thought the data was going to be like node and links. This is not the format we talked about.

Are you talking about the interval tree data?

Yes. Because I thought we were going to visualize as node-link diagram

could you give me some samples to look at? I am not sure how it should look like.

Just like the ones we did for the first assignment:

{
nodes:[
{
id:1,
name: "sample"
},
...
],
links: [
{
source: 1,
target: 2
}
]
}

If it takes time we don't need to do it. If I have this format I could do it in half an hour.

SiRumCz commented 4 years ago

and I thought the data was going to be like node and links. This is not the format we talked about.

Are you talking about the interval tree data?

Yes. Because I thought we were going to visualize as node-link diagram

could you give me some samples to look at? I am not sure how it should look like.

Just like the ones we did for the first assignment:

{
nodes:[
{
id:1,
name: "sample"
},
...
],
links: [
{
source: 1,
target: 2
}
]
}

If it takes time we don't need to do it. If I have this format I could do it in half an hour.

I can try this one, but I don't think this is interval tree. I guess we don't need to visualise the interval tree. I can just talk about the api data.

soroushysfi commented 4 years ago

Oh I

and I thought the data was going to be like node and links. This is not the format we talked about.

Are you talking about the interval tree data?

Yes. Because I thought we were going to visualize as node-link diagram

could you give me some samples to look at? I am not sure how it should look like.

Just like the ones we did for the first assignment:

{
nodes:[
{
id:1,
name: "sample"
},
...
],
links: [
{
source: 1,
target: 2
}
]
}

If it takes time we don't need to do it. If I have this format I could do it in half an hour.

I can try this one, but I don't think this is interval tree. But I guess we don't need to do the interval tree. I can just talk about the api data.

Oh I see what you're saying. I thought we were going to show a tree. I can try and see if I can show it with line plot

SiRumCz commented 4 years ago

Oh I

and I thought the data was going to be like node and links. This is not the format we talked about.

Are you talking about the interval tree data?

Yes. Because I thought we were going to visualize as node-link diagram

could you give me some samples to look at? I am not sure how it should look like.

Just like the ones we did for the first assignment:

{
nodes:[
{
id:1,
name: "sample"
},
...
],
links: [
{
source: 1,
target: 2
}
]
}

If it takes time we don't need to do it. If I have this format I could do it in half an hour.

I can try this one, but I don't think this is interval tree. But I guess we don't need to do the interval tree. I can just talk about the api data.

Oh I see what you're saying. I thought we were going to show a tree. I can try and see if I can show it with line plot

Only if you have time, we need to finish a technical report tonight as well.

SiRumCz commented 4 years ago

@soroushysfi I just checked the trip data, I don't think I have time to finish node link diagram data on this one. Sorry.

soroushysfi commented 4 years ago

@soroushysfi I just checked the trip data, I don't think I have time to finish node link diagram data on this one. Sorry.

It's ok I added interval trees to our visualizations. Making a new PR now.