Open dcjohnson24 opened 2 years ago
I do think that for historical versions you probably would want to pick up all successive versions, because if there's a new version released it likely contains corrections. So the best way to treat a schedule zipfile is that it should be considered "in effect" for the period where both the following are true:
feed_info.txt
(if present) indicates that the feed is valid, i.e., dates between feed_info.feed_start_date
and feed_info.feed_end_date
, if both those fields are present & populatedTheoretically, if both those conditions are met but there are no active services in calendar.txt
or calendar_dates.txt
the feed is explicitly saying that there as no service scheduled for the periods where the feed is "in effect" but no service is active, if that makes sense. I can also pair with someone on this if needed. I think we may want to create some kind of shared config file in the repo that contains feed "in effect" logic that we can use across contexts for consistency and to avoid making everyone deal with this problem individually.... 🤔
Thanks for the explanation. I can try to write something that would check for those two conditions. Would someone still want to download that feed even if calendar.txt
and calendar_dates.txt
are empty?
If calendar
and calendar_dates
are empty, that's presumably an error and I'd assume that they uploaded a new version fairly quickly. So perhaps in that case you'd just keep the most recent one with service populated.
(Sidebar, this kind of thing is why I want to take a second look at the general schedule aggregation code in the RT vs. schedule script.)
I've just submitted a pull request (#35) that makes a first attempt at generating schedule feeds. I wasn't sure how to handle the following case though.
In [10]: create_schedule_list(10, 2020)
INFO:root: Searching page 1
INFO:root: Searching page 2
INFO:root: Searching page 3
INFO:root: Searching page 4
INFO:root: Searching page 5
INFO:root: Searching page 6
INFO:root: Found schedule for October 2020
INFO:root: Adding schedule for October 10, 2020
INFO:root: The duplicate schedule versions are {'1 September 2021'}. Check whether these were in-effect
Out[10]:
[{'schedule_version': '20201010',
'feed_start_date': '2020-10-11',
'feed_end_date': '2020-11-13'},
{'schedule_version': '20201114',
'feed_start_date': '2020-11-15',
'feed_end_date': '2020-11-19'},
{'schedule_version': '20201120',
'feed_start_date': '2020-11-21',
'feed_end_date': '2020-12-11'},
{'schedule_version': '20201212',
'feed_start_date': '2020-12-13',
'feed_end_date': '2021-01-03'},
{'schedule_version': '20210104',
'feed_start_date': '2021-01-05',
'feed_end_date': '2021-03-17'},
{'schedule_version': '20210318',
'feed_start_date': '2021-03-19',
'feed_end_date': '2021-03-25'},
{'schedule_version': '20210326',
'feed_start_date': '2021-03-27',
'feed_end_date': '2021-04-22'},
{'schedule_version': '20210423',
'feed_start_date': '2021-04-24',
'feed_end_date': '2021-04-26'},
{'schedule_version': '20210427',
'feed_start_date': '2021-04-28',
'feed_end_date': '2021-05-03'},
{'schedule_version': '20210504',
'feed_start_date': '2021-05-05',
'feed_end_date': '2021-05-12'},
{'schedule_version': '20210513',
'feed_start_date': '2021-05-14',
'feed_end_date': '2021-05-27'},
{'schedule_version': '20210528',
'feed_start_date': '2021-05-29',
'feed_end_date': '2021-06-09'},
{'schedule_version': '20210610',
'feed_start_date': '2021-06-11',
'feed_end_date': '2021-06-14'},
{'schedule_version': '20210615',
'feed_start_date': '2021-06-16',
'feed_end_date': '2021-08-01'},
{'schedule_version': '20210802',
'feed_start_date': '2021-08-03',
'feed_end_date': '2021-08-31'},
{'schedule_version': '20210901',
'feed_start_date': '2021-09-02',
'feed_end_date': '2021-09-06'},
{'schedule_version': '20210907',
'feed_start_date': '2021-09-08',
...
]
There are multiple versions of 1 September 2021 here. Would this violate condition 1 because there is no gap between successive schedule versions? I was planning to drop it, but it looks like calendar.txt
is nonempty. I settled for dropping the duplicates and keeping one version of 1 September 2021 for the date range computation. I'm not sure whether this is the correct approach though.
Oof, that's wild (that they had 3 versions). In cases of multiples, I'd just keep the final version that was left up (because that was the one that was actually online for subsequent days). Because the way this works on the actual CTA website is just that there's a current version, and whatever is the final version uploaded on a given date is the one that was left on the actual website the longest (into subsequent dates).
Okay thanks, I'll modify it to take the latest version then.
It will be important to start tracking the number of scheduled trips starting from pre-COVID up to date. This will help to check whether the CTA decides to lower the number of scheduled trips to match the actual trips. The reduction in scheduled trips will improve the trip ratios, but the bus service will still be lower than pre-COVID levels, a less than ideal scenario.
Data
To access older data, you will need to look at schedule versions from transitfeeds.com dating back to 2019. You probably do not need to choose every schedule version for a given year because there is some overlap of the date ranges between the versions. It is good to check, however, that the schedule versions you choose span the entire year. For 2019, for example, you could choose the versions "7 November 2018" (6 November 2018 - 31 January 2019), "31 January 2019" (30 January 2019 - 31 March 2019), "14 April 2019" (29 March 2019 - 31 May 2019), "16 May 2019" (13 May 2019 - 31 July 2019), "5 August 2019" (1 August 2019 - 31 October 2019), "4 October 2019" (4 October 2019 - 31 December 2019). You could then drop the 2018 dates and duplicates that may arise from the overlapping dates.
Set up your virtual environment with the required packages by following the instructions in the README, and activate it. Once you have the schedule feeds of interest, run the snippet from inside the
data_analysis
directory. If running from the project root, changestatic_gtfs_analysis
todata_analysis.static_gtfs_analysis
.Access the schedule data in the list with
schedule_df
should have enough information to generate plots of scheduled trip counts by day.Example