cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Bug: SacRT agency shows no service for several months despite being active #914

Closed lauriemerrell closed 2 years ago

lauriemerrell commented 2 years ago

Describe the bug The October and November published reports and the draft December report for SacRT show 0 service activity. Service hours for all days for all three months are 0.

However this agency was active during these months.

Not sure if this is an error with the data infra side (is the data being pulled) or the report.... Submitting on infra side for now?

To Reproduce Steps to reproduce the behavior:

  1. Go to any of the monthly reports linked above
  2. Scroll down to the service hours chart
  3. No active hours

Expected behavior Should show some positive active hours for this agency.

Nkdiaz commented 2 years ago

It was previously reported that Yolo County Transportation District (itp_id 372) is also not showing up in gtfs_schedule_fact_daily_service. I can verify that Yolo County Transportation is also not included in the list of agencies that we generate reports for October and November.

Nkdiaz commented 2 years ago

Writing down my research journey,

Here is the following query I used:

WITH

stg_daily_service_keyed AS (
  SELECT
    T2.feed_key
    , T1.*
  FROM `views.gtfs_schedule_stg_daily_service` T1
  JOIN `views.gtfs_schedule_dim_feeds` T2
    ON
      T1.calitp_itp_id = T2.calitp_itp_id
      AND T1.calitp_url_number = T2.calitp_url_number
      AND T2.calitp_extracted_at <= T1.service_date
      AND T2.calitp_deleted_at > T1.service_date

),

daily_service_trips AS (
  SELECT
    t1.feed_key
    , t2.trip_key
    , t2.trip_id
    , t2.route_id
    , t1.* EXCEPT (feed_key)
  FROM stg_daily_service_keyed t1
  JOIN `views.gtfs_schedule_dim_trips` t2
    USING (calitp_itp_id, calitp_url_number)
# Here is where the change was done
  WHERE
    t2.calitp_extracted_at <= t1.service_date
    AND t2.calitp_deleted_at > t1.service_date
),
service_dates AS (
  (SELECT DISTINCT service_date FROM `views.gtfs_schedule_stg_daily_service`)
),
trip_summary AS (
  SELECT
    t1.calitp_itp_id
    , t1.calitp_url_number
    , t1.trip_id
    , t2.service_date
    , COUNT(DISTINCT t1.stop_id) AS n_stops
    , COUNT(*) AS n_stop_times
    , MIN(t1.departure_ts) AS trip_first_departure_ts
    , MAX(t1.arrival_ts) AS trip_last_arrival_ts
  FROM `views.gtfs_schedule_dim_stop_times` t1
  JOIN  service_dates t2
  ON t1.calitp_extracted_at <= t2.service_date
    AND COALESCE(t1.calitp_deleted_at, DATE("2099-01-01")) > t2.service_date
  GROUP BY 1, 2, 3, 4
)
SELECT * from trip_summary where calitp_itp_id = 273
holly-g commented 2 years ago

@lauriemerrell and @Nkdiaz will sync later today

edasmalchi commented 2 years ago

I think I might have found it!

service_id in gtfs_schedule_stg_daily_service appears to have leading whitespace but service_id in gtfs_schedule_dim_trips does not, which is probably why the join fails.

SELECT * FROM `cal-itp-data-infra.views.gtfs_schedule_stg_daily_service`
WHERE calitp_itp_id = 273 LIMIT 10

returns

{
    "calitp_itp_id": "273",
    "calitp_url_number": "0",
    "service_id": " 7",
    "service_date": "2021-10-19",
    "calitp_extracted_at": "2021-07-31",
    "calitp_deleted_at": "2021-12-17",
    "service_indicator": "0",
    "service_start_date": "2021-08-29",
    "service_end_date": "2022-01-01",
    "service_inclusion": null,
    "service_exclusion": null,
    "is_in_service": false
  },

while

SELECT * FROM `cal-itp-data-infra.views.gtfs_schedule_dim_trips`
WHERE calitp_itp_id = 273 LIMIT 10

returns

{
    "calitp_itp_id": "273",
    "calitp_url_number": "0",
    "route_id": "533",
    "service_id": "7",
    "trip_id": "968044",
    "shape_id": "39629",
    "trip_headsign": "BLUE LINE TO WATT I-80",
    "trip_short_name": null,
    "direction_id": "1",
    "block_id": "0008",
    "wheelchair_accessible": null,
    "bikes_allowed": null,
    "calitp_extracted_at": "2021-05-17",
    "calitp_hash": "/vW4LhG2spNwQrAc6Chfag==",
    "trip_key": "8782899296912744528",
    "calitp_deleted_at": "2021-07-31"
  },
Nkdiaz commented 2 years ago

Great work @edasmalchi , I wasn't viewing the results in Json which is why I might have missed it. I ran this query

SELECT distinct(calitp_itp_id), service_id FROM `cal-itp-data-infra.views.gtfs_schedule_stg_daily_service`
WHERE service_id like ' %'

To see if there are any other leading white spaces in the table and it returns calitp_ids 271(Roseville) and 273(SacRT). What's interesting is that Roseville is still able to have its service hours generated in the reports. I can update the table generation to trim the whitespaces

evansiroky commented 2 years ago

I downloaded the raw files for both Roseville and SacRT and am seeing the raw data includes the preceding whitespace in the calendar.txt files, but not in the trips.txt file. Shown below is the calendar.txt file for SacRT. Roseville has some, but not all of their service IDs with leading whitespace.

Screen Shot 2022-01-10 at 11 18 48 AM

We need to make a decision on whether or not to treat leading whitespace in a strict sense or tolerate it in our pipeline.

In the strictest sense, this leading space is actually a part of the ID. The GTFS spec says in the field type docs that for an ID field:

An ID field value is an internal ID, not intended to be shown to riders, and is a sequence of any UTF-8 characters.

Also, in the File Requirements doc section it states:

Remove any extra spaces between fields or field names. Many parsers consider the spaces to be part of the value, which may cause errors.

However, in v3.0.0 of the gtfs-validator, https://github.com/MobilityData/gtfs-validator/pull/929 downgraded the notice of LeadingOrTrailingWhitespacesNotice from ERROR to WARNING. This was due to aligning the spec with RFC 2119 as noted in https://github.com/MobilityData/gtfs-validator/issues/920. So that perhaps indicates that this desire of having no extra spaces may not be a strict requirement for ID fields. Regardless, the LeadingOrTrailingWhitespacesNotice will point out the issue.

But as an aside, the LeadingOrTrailingWhitespacesNotice was not outputted in our gtfs-validation pipeline as seen in the very last question of the GTFS Guidelines dashboard:

Screen Shot 2022-01-10 at 11 54 13 AM

That seems to be that LeadingOrTrailingWhitespacesNotice was added in https://github.com/MobilityData/gtfs-validator/pull/729 which appears to have made it into v2.0.0 of the gtfs-validator. So that means the gtfs-validator version our pipeline is really outdated now, especially considering that v3.0.0 was just released today. In fact it appears we are using v1.0.0 in our gtfs-validator-api Dockerfile which is nearly 2 years old now.

lauriemerrell commented 2 years ago

Thanks @evansiroky... Can we make a new separate issue to investigate/track bumping our pipeline use a more recent version of the validator? I'm happy to do that (make the ticket; I'm not sure how hard it is to actually implement) unless you'd prefer to.

Definitely agree that any adjustment here should be universal (i.e., we accept leading/trailing whitespace in general or we do not... so, we'd want to update all fields, not just service_id)

In this case, because they have the leading whitespace in one file and not another, I'm kind of inclined to think of the data as wrong (corrupt) and flag it for the agency (if that's something that we can easily do?) rather than trimming it on our side. Who else, if anyone, should be part of making this call?

Finally, I wonder if we want to try to assess how widespread this is for other fields (leading/trailing whitespace in one file but not another leading to failed join)? Bumping the validator and then looking for LeadingOrTrailingWhitespacesNotice might be the easiest way to do that, unless we think it's worth it to search manually. Seems plausible that this could be affecting routes, trips, etc.

evansiroky commented 2 years ago

Thanks for chiming in @lauriemerrell. We have https://github.com/cal-itp/data-infra/issues/685 to track updating to v3 of the validator. A quick side note about that is that Michael asserted that we're running v2 of the validator. Sure enough, I am seeing v2.0.0 being returned in the SacRT validation report JSON file, so I am really confused how that is possible given the current code setup and also how that notice wasn't outputted.

Regarding what to do about the leading/trailing whitespaces, I would appreciate it if @Nkdiaz, @edasmalchi and @lauriemerrell all paired together to figure out what to do about it. Can you all coordinate without me to come up with a recommendation?

lauriemerrell commented 2 years ago

A few extra notes before Natalya, Eric, and I meet later this afternoon.

I ran the version 2.0.0 GTFS validator locally and confirmed that it does not raise the LeadingOrTrailingWhitespacesNotice for this feed even though the issue is present (results attached for reference). So I think this is a bug in the validator and not an issue with our pipeline? sacrt_local_validation_results.json.zip

Interestingly, this query:

SELECT DISTINCT filename FROM cal-itp-data-infra.views.validation_fact_daily_feed_notices WHERE code = 'leading_or_trailing_whitespaces'

Indicates that the only files for which this error have been raised are stop_times.txt, stops.txt, trips.txt, and routes.txt (i.e., we've never had this error raised for calendar.txt or calendar_dates.txt which is where it's appearing for SacRT).

Nkdiaz commented 2 years ago

For documentation's sake, after meeting with @edasmalchi and @lauriemerrell, we will split this issue into 3 tickets Issue: SacRT is not appearing in reports or available for analysis: Decision: Trim SacRT and Rose specifically in the validation_fact_daily_feed_notices so it is available for reports and for Analysis. We will not apply the trimming to the data uniformly.

Issue: SacRT data appears to be inherently corrupted with leading white spaces. if possible we should have them standardize the white spaces before we ingest it Decision: reach out to team members at calitp who can help advise a way to contact SacRT

Issue: We were unable to view if a calitp_id is dropped from a join like SacRT and Roseville were for validation_fact_daily_feed_notices. Decision: Research a way to log when an entry is dropped from a table join in bigquery

lauriemerrell commented 2 years ago

(ope sorry @Nkdiaz I was drafting this when you posted.... posting anyway for additional detail)

@Nkdiaz, @edasmalchi, and I met today to discuss this issue.

Conversation takeaways:

Next steps:

lauriemerrell commented 2 years ago

Ok one (last? 🤞) update. Per Slack discussion, it sounds like there is a preference to just trim whitespace more generally. Closing #943 above and creating #946 instead.

holly-g commented 2 years ago

See #946