SiRumCz / CSC501

CSC501 assignments
0 stars 1 forks source link

data modelling version 2 #55

Closed SiRumCz closed 4 years ago

SiRumCz commented 4 years ago

Soroush raised a problem with our current data schema. He suggests we should enrich our data model by dividing data into different time-related tables and adding more temporal information to them.

SiRumCz commented 4 years ago

To give better visual presentation for payment-trend-timeline, I and Soroush disussed about it and both agreed that the dataset sample provided is not sufficient to be presented(data since 2019 January is completely missing), and we decided to move one and work on the original 112M dataset to extract full January to December taxi trips.

SiRumCz commented 4 years ago

After I worked on the original dataset, I extracted 112,234,626 rows of data, and applied two filters to narrow the period of the data down to 2018(Jan to Dec) and remove duplicate data. The size of data is 102,801,293 (around 11.8GB .db file) after filtering.

jonhealy1 commented 4 years ago

@SiRumCz Wow cool sounds great!