chihacknight / chn-ghost-buses

"Ghost buses" analysis project through Chi Hack Night
https://github.com/chihacknight/breakout-groups/issues/217
MIT License
19 stars 14 forks source link

[Data] Time series plot of scheduled trips since 2019 #30

Open dcjohnson24 opened 1 year ago

dcjohnson24 commented 1 year ago

It will be important to start tracking the number of scheduled trips starting from pre-COVID up to date. This will help to check whether the CTA decides to lower the number of scheduled trips to match the actual trips. The reduction in scheduled trips will improve the trip ratios, but the bus service will still be lower than pre-COVID levels, a less than ideal scenario.

Data

To access older data, you will need to look at schedule versions from transitfeeds.com dating back to 2019. You probably do not need to choose every schedule version for a given year because there is some overlap of the date ranges between the versions. It is good to check, however, that the schedule versions you choose span the entire year. For 2019, for example, you could choose the versions "7 November 2018" (6 November 2018 - 31 January 2019), "31 January 2019" (30 January 2019 - 31 March 2019), "14 April 2019" (29 March 2019 - 31 May 2019), "16 May 2019" (13 May 2019 - 31 July 2019), "5 August 2019" (1 August 2019 - 31 October 2019), "4 October 2019" (4 October 2019 - 31 December 2019). You could then drop the 2018 dates and duplicates that may arise from the overlapping dates.

Set up your virtual environment with the required packages by following the instructions in the README, and activate it. Once you have the schedule feeds of interest, run the snippet from inside the data_analysis directory. If running from the project root, change static_gtfs_analysis to data_analysis.static_gtfs_analysis.

from tqdm import tqdm
import pandas as pd
import logging
import static_gtfs_analysis

logger = logging.getLogger()
logging.basicConfig(level=logging.INFO)

def fetch_data_from_schedule(schedule_feeds: List[dict]) -> List[dict]:
    """Retrieve data from the GTFS file for various schedule versions.

    Args:
        schedule_feeds (List[dict]): A list of dictionaries containing
            the version, start_date, and end_date as keys.

    Returns:
        List[dict]: A list of dictionaries with the schedule version and the
            corresponding data.
    """
    schedule_data_list = []
    pbar = tqdm(schedule_feeds)
    for feed in pbar:
        schedule_version = feed["schedule_version"]
        pbar.set_description(
            f"Generating daily schedule data for "
            f"schedule version {schedule_version}"
        )
        logging.info(
            f"\nDownloading zip file for schedule version "
            f"{schedule_version}"
        )
        CTA_GTFS = static_gtfs_analysis.download_zip(schedule_version)
        logging.info("\nExtracting data")
        data = static_gtfs_analysis.GTFSFeed.extract_data(
            CTA_GTFS,
            version_id=schedule_version
        )
        data = static_gtfs_analysis.format_dates_hours(data)

        logging.info("\nSummarizing trip data")
        trip_summary = static_gtfs_analysis.make_trip_summary(data)

        route_daily_summary = (
            static_gtfs_analysis
            .summarize_date_rt(trip_summary)
        )

        schedule_data_list.append(
            {"schedule_version": schedule_version,
                "data": route_daily_summary}
        )
    return schedule_data_list

schedule_feeds = [
    {
        "schedule_version": "20181107",
        "feed_start_date": "2018-11-06",
        "feed_end_date": "2019-01-31" 
    },
    # Enter remaining schedule versions of interest from 2019, 2020, and 2021
    {
        "schedule_version": "20220507",
        "feed_start_date": "2022-05-20",
        "feed_end_date": "2022-06-02",
    },
    {
        "schedule_version": "20220603",
        "feed_start_date": "2022-06-04",
        "feed_end_date": "2022-06-07",
    },
    {
        "schedule_version": "20220608",
        "feed_start_date": "2022-06-09",
        "feed_end_date": "2022-07-08",
    },
    {
        "schedule_version": "20220709",
        "feed_start_date": "2022-07-10",
        "feed_end_date": "2022-07-17",
    },
    {
        "schedule_version": "20220718",
        "feed_start_date": "2022-07-19",
        "feed_end_date": "2022-07-20",
    },
]

schedule_data_list = fetch_data_from_schedule(schedule_feeds)

Access the schedule data in the list with

schedule_df = pd.concat([feed["data"] for feed in schedule_data_list])
schedule_df.drop_duplicates(inplace=True)

schedule_df should have enough information to generate plots of scheduled trip counts by day.

Example

schedule_feeds = [
    {
        "schedule_version": "20181107",
        "feed_start_date": "2018-11-06",
        "feed_end_date": "2019-01-31"
    },
    {
        "schedule_version": "20190131",
        "feed_start_date": "2019-01-30",
        "feed_end_date": "2019-03-31"
    }
]

schedule_data_list = fetch_data_from_schedule(schedule_feeds)

schedule_df = pd.concat([feed["data"] for feed in schedule_data_list])
print(schedule_df.duplicated().sum())
# 194 duplicates
schedule_df.drop_duplicates(inplace=True)
print(schedule_df.head())
#        date route_id  trip_count
# 0  2018-11-06        1          95
# 1  2018-11-06      100          73
# 2  2018-11-06      103         194
# 3  2018-11-06      106         169
# 4  2018-11-06      108          97

# Let's check the date ranges
print(f"The earliest date is {schedule_df.date.min()}")
print(f"The latest date is {schedule_df.date.max()}")
# The earliest date is 2018-11-06
# The latest date is 2019-03-31

# Drop the 2018 dates
schedule_df['date'] = pd.to_datetime(schedule_df["date"])
schedule_df_2019 = schedule_df.loc[schedule_df['date'].dt.year != 2018].copy()

# Print the date ranges again
print(f"The earliest date is {schedule_df_2019.date.min()}")
print(f"The latest date is {schedule_df_2019.date.max()}")
# The earliest date is 2019-01-01 00:00:00
# The latest date is 2019-03-31 00:00:00
lauriemerrell commented 1 year ago

I do think that for historical versions you probably would want to pick up all successive versions, because if there's a new version released it likely contains corrections. So the best way to treat a schedule zipfile is that it should be considered "in effect" for the period where both the following are true:

  1. It was online (i.e., the period between when it was uploaded and when the next version was uploaded)
  2. feed_info.txt (if present) indicates that the feed is valid, i.e., dates between feed_info.feed_start_date and feed_info.feed_end_date, if both those fields are present & populated

Theoretically, if both those conditions are met but there are no active services in calendar.txt or calendar_dates.txt the feed is explicitly saying that there as no service scheduled for the periods where the feed is "in effect" but no service is active, if that makes sense. I can also pair with someone on this if needed. I think we may want to create some kind of shared config file in the repo that contains feed "in effect" logic that we can use across contexts for consistency and to avoid making everyone deal with this problem individually.... 🤔

dcjohnson24 commented 1 year ago

Thanks for the explanation. I can try to write something that would check for those two conditions. Would someone still want to download that feed even if calendar.txt and calendar_dates.txt are empty?

lauriemerrell commented 1 year ago

If calendar and calendar_dates are empty, that's presumably an error and I'd assume that they uploaded a new version fairly quickly. So perhaps in that case you'd just keep the most recent one with service populated.

(Sidebar, this kind of thing is why I want to take a second look at the general schedule aggregation code in the RT vs. schedule script.)

dcjohnson24 commented 1 year ago

I've just submitted a pull request (#35) that makes a first attempt at generating schedule feeds. I wasn't sure how to handle the following case though.

In [10]: create_schedule_list(10, 2020)
INFO:root: Searching page 1
INFO:root: Searching page 2
INFO:root: Searching page 3
INFO:root: Searching page 4
INFO:root: Searching page 5
INFO:root: Searching page 6
INFO:root: Found schedule for October 2020
INFO:root: Adding schedule for October 10, 2020
INFO:root: The duplicate schedule versions are {'1 September 2021'}. Check whether these were in-effect
Out[10]: 
[{'schedule_version': '20201010',
  'feed_start_date': '2020-10-11',
  'feed_end_date': '2020-11-13'},
 {'schedule_version': '20201114',
  'feed_start_date': '2020-11-15',
  'feed_end_date': '2020-11-19'},
 {'schedule_version': '20201120',
  'feed_start_date': '2020-11-21',
  'feed_end_date': '2020-12-11'},
 {'schedule_version': '20201212',
  'feed_start_date': '2020-12-13',
  'feed_end_date': '2021-01-03'},
 {'schedule_version': '20210104',
  'feed_start_date': '2021-01-05',
  'feed_end_date': '2021-03-17'},
 {'schedule_version': '20210318',
  'feed_start_date': '2021-03-19',
  'feed_end_date': '2021-03-25'},
 {'schedule_version': '20210326',
  'feed_start_date': '2021-03-27',
  'feed_end_date': '2021-04-22'},
 {'schedule_version': '20210423',
  'feed_start_date': '2021-04-24',
  'feed_end_date': '2021-04-26'},
 {'schedule_version': '20210427',
  'feed_start_date': '2021-04-28',
  'feed_end_date': '2021-05-03'},
 {'schedule_version': '20210504',
  'feed_start_date': '2021-05-05',
  'feed_end_date': '2021-05-12'},
 {'schedule_version': '20210513',
  'feed_start_date': '2021-05-14',
  'feed_end_date': '2021-05-27'},
 {'schedule_version': '20210528',
  'feed_start_date': '2021-05-29',
  'feed_end_date': '2021-06-09'},
 {'schedule_version': '20210610',
  'feed_start_date': '2021-06-11',
  'feed_end_date': '2021-06-14'},
 {'schedule_version': '20210615',
  'feed_start_date': '2021-06-16',
  'feed_end_date': '2021-08-01'},
 {'schedule_version': '20210802',
  'feed_start_date': '2021-08-03',
  'feed_end_date': '2021-08-31'},
 {'schedule_version': '20210901',
  'feed_start_date': '2021-09-02',
  'feed_end_date': '2021-09-06'},
 {'schedule_version': '20210907',
  'feed_start_date': '2021-09-08',
...
]

There are multiple versions of 1 September 2021 here. Would this violate condition 1 because there is no gap between successive schedule versions? I was planning to drop it, but it looks like calendar.txt is nonempty. I settled for dropping the duplicates and keeping one version of 1 September 2021 for the date range computation. I'm not sure whether this is the correct approach though.

lauriemerrell commented 1 year ago

Oof, that's wild (that they had 3 versions). In cases of multiples, I'd just keep the final version that was left up (because that was the one that was actually online for subsequent days). Because the way this works on the actual CTA website is just that there's a current version, and whatever is the final version uploaded on a given date is the one that was left on the actual website the longest (into subsequent dates).

dcjohnson24 commented 1 year ago

Okay thanks, I'll modify it to take the latest version then.