Open Fmazin opened 1 year ago
To do as of Sept 26 15:00: Find gaps in data, apparently no data in May? import calmap group by calendar date, bus_id (this counts as a trip, since a stop sequence 1~27 has the same bus_id until it changes to another stop sequence.) We could get the delay per trip, get the average of that, and then make a graph of that per day. (Jan 8 ~ June 30)
problem: group by calendar date returns multiple rows of the same date?
Done: correlation matrix, Daily total arrival delay over time (Jan 8 - June 30), year plot of data frequency per day.
things to consider: sommartidtabell started 24 June 2022 (midsommar). studenten is around early june
In the csv, early buses are shown as negative delays (ie a bus that is 60 minutes early has an arrival_delay = -60). SO If there is a bus that is 60 minutes late on the same day, the total delay for that day is.... 0 seconds. We can either disregard early buses as outliers, or get the absolute value of the early buses to get a "deviation from scheduled time" instead of delays (we have a graph to compare abs to delay).
Also there's a bus that is supposedly 58 minutes early.
May 25: there are buses (plural) that are 30 min late, and there are buses (plural!) that are 30 min early. It could be that the buses are making up for these late buses since they are 30 min early.
June 9: Simply late. no buses that are AS early to make up for it. probably studenten.
June 30: a single, 1 hour early (bus id 41361): no delays as significant that can justify this. We are thinking we can exclude this early bus on this day as outliers.
Create graphs that the correlation of the attributes to each other, their general spread, etc.
Only start this one after finishing #1 and #2.