Closed lauriemerrell closed 2 years ago
It was previously reported that Yolo County Transportation District (itp_id 372) is also not showing up in gtfs_schedule_fact_daily_service. I can verify that Yolo County Transportation is also not included in the list of agencies that we generate reports for October and November.
Writing down my research journey,
tbl_views.gtfs_schedule_fact_daily_service
tbl_daily_service_hours
nor in the tbl.views.gtfs.schedule_fact_daily_service
but it does appear in the gtfs_schedule_service_daily
with the values is in service marked as truegtfs_schedule_fact_daily_trips
, I dug deeper into the sql query. stg_daily_service subtable is able to be generated with calitp 273 but daily_service_trips subtable is not able to be generated with calitp 273.JOIN `views.gtfs_schedule_dim_trips` t2
USING (calitp_itp_id, calitp_url_number, service_id)
and change it to:
JOIN `views.gtfs_schedule_dim_trips` t2
USING (calitp_itp_id, calitp_url_number)
you are able to calculate the service hours for Sacramento
Here is the following query I used:
WITH
stg_daily_service_keyed AS (
SELECT
T2.feed_key
, T1.*
FROM `views.gtfs_schedule_stg_daily_service` T1
JOIN `views.gtfs_schedule_dim_feeds` T2
ON
T1.calitp_itp_id = T2.calitp_itp_id
AND T1.calitp_url_number = T2.calitp_url_number
AND T2.calitp_extracted_at <= T1.service_date
AND T2.calitp_deleted_at > T1.service_date
),
daily_service_trips AS (
SELECT
t1.feed_key
, t2.trip_key
, t2.trip_id
, t2.route_id
, t1.* EXCEPT (feed_key)
FROM stg_daily_service_keyed t1
JOIN `views.gtfs_schedule_dim_trips` t2
USING (calitp_itp_id, calitp_url_number)
# Here is where the change was done
WHERE
t2.calitp_extracted_at <= t1.service_date
AND t2.calitp_deleted_at > t1.service_date
),
service_dates AS (
(SELECT DISTINCT service_date FROM `views.gtfs_schedule_stg_daily_service`)
),
trip_summary AS (
SELECT
t1.calitp_itp_id
, t1.calitp_url_number
, t1.trip_id
, t2.service_date
, COUNT(DISTINCT t1.stop_id) AS n_stops
, COUNT(*) AS n_stop_times
, MIN(t1.departure_ts) AS trip_first_departure_ts
, MAX(t1.arrival_ts) AS trip_last_arrival_ts
FROM `views.gtfs_schedule_dim_stop_times` t1
JOIN service_dates t2
ON t1.calitp_extracted_at <= t2.service_date
AND COALESCE(t1.calitp_deleted_at, DATE("2099-01-01")) > t2.service_date
GROUP BY 1, 2, 3, 4
)
SELECT * from trip_summary where calitp_itp_id = 273
@lauriemerrell and @Nkdiaz will sync later today
I think I might have found it!
service_id
in gtfs_schedule_stg_daily_service
appears to have leading whitespace but service_id
in gtfs_schedule_dim_trips
does not, which is probably why the join fails.
SELECT * FROM `cal-itp-data-infra.views.gtfs_schedule_stg_daily_service`
WHERE calitp_itp_id = 273 LIMIT 10
returns
{
"calitp_itp_id": "273",
"calitp_url_number": "0",
"service_id": " 7",
"service_date": "2021-10-19",
"calitp_extracted_at": "2021-07-31",
"calitp_deleted_at": "2021-12-17",
"service_indicator": "0",
"service_start_date": "2021-08-29",
"service_end_date": "2022-01-01",
"service_inclusion": null,
"service_exclusion": null,
"is_in_service": false
},
while
SELECT * FROM `cal-itp-data-infra.views.gtfs_schedule_dim_trips`
WHERE calitp_itp_id = 273 LIMIT 10
returns
{
"calitp_itp_id": "273",
"calitp_url_number": "0",
"route_id": "533",
"service_id": "7",
"trip_id": "968044",
"shape_id": "39629",
"trip_headsign": "BLUE LINE TO WATT I-80",
"trip_short_name": null,
"direction_id": "1",
"block_id": "0008",
"wheelchair_accessible": null,
"bikes_allowed": null,
"calitp_extracted_at": "2021-05-17",
"calitp_hash": "/vW4LhG2spNwQrAc6Chfag==",
"trip_key": "8782899296912744528",
"calitp_deleted_at": "2021-07-31"
},
Great work @edasmalchi , I wasn't viewing the results in Json which is why I might have missed it. I ran this query
SELECT distinct(calitp_itp_id), service_id FROM `cal-itp-data-infra.views.gtfs_schedule_stg_daily_service`
WHERE service_id like ' %'
To see if there are any other leading white spaces in the table and it returns calitp_ids 271(Roseville) and 273(SacRT). What's interesting is that Roseville is still able to have its service hours generated in the reports. I can update the table generation to trim the whitespaces
I downloaded the raw files for both Roseville and SacRT and am seeing the raw data includes the preceding whitespace in the calendar.txt files, but not in the trips.txt file. Shown below is the calendar.txt file for SacRT. Roseville has some, but not all of their service IDs with leading whitespace.
We need to make a decision on whether or not to treat leading whitespace in a strict sense or tolerate it in our pipeline.
In the strictest sense, this leading space is actually a part of the ID. The GTFS spec says in the field type docs that for an ID field:
An ID field value is an internal ID, not intended to be shown to riders, and is a sequence of any UTF-8 characters.
Also, in the File Requirements doc section it states:
Remove any extra spaces between fields or field names. Many parsers consider the spaces to be part of the value, which may cause errors.
However, in v3.0.0 of the gtfs-validator, https://github.com/MobilityData/gtfs-validator/pull/929 downgraded the notice of LeadingOrTrailingWhitespacesNotice
from ERROR to WARNING. This was due to aligning the spec with RFC 2119 as noted in https://github.com/MobilityData/gtfs-validator/issues/920. So that perhaps indicates that this desire of having no extra spaces may not be a strict requirement for ID fields. Regardless, the LeadingOrTrailingWhitespacesNotice
will point out the issue.
But as an aside, the LeadingOrTrailingWhitespacesNotice
was not outputted in our gtfs-validation pipeline as seen in the very last question of the GTFS Guidelines dashboard:
That seems to be that LeadingOrTrailingWhitespacesNotice
was added in https://github.com/MobilityData/gtfs-validator/pull/729 which appears to have made it into v2.0.0 of the gtfs-validator. So that means the gtfs-validator version our pipeline is really outdated now, especially considering that v3.0.0 was just released today. In fact it appears we are using v1.0.0 in our gtfs-validator-api Dockerfile which is nearly 2 years old now.
Thanks @evansiroky... Can we make a new separate issue to investigate/track bumping our pipeline use a more recent version of the validator? I'm happy to do that (make the ticket; I'm not sure how hard it is to actually implement) unless you'd prefer to.
Definitely agree that any adjustment here should be universal (i.e., we accept leading/trailing whitespace in general or we do not... so, we'd want to update all fields, not just service_id
)
In this case, because they have the leading whitespace in one file and not another, I'm kind of inclined to think of the data as wrong (corrupt) and flag it for the agency (if that's something that we can easily do?) rather than trimming it on our side. Who else, if anyone, should be part of making this call?
Finally, I wonder if we want to try to assess how widespread this is for other fields (leading/trailing whitespace in one file but not another leading to failed join)? Bumping the validator and then looking for LeadingOrTrailingWhitespacesNotice
might be the easiest way to do that, unless we think it's worth it to search manually. Seems plausible that this could be affecting routes, trips, etc.
Thanks for chiming in @lauriemerrell. We have https://github.com/cal-itp/data-infra/issues/685 to track updating to v3 of the validator. A quick side note about that is that Michael asserted that we're running v2 of the validator. Sure enough, I am seeing v2.0.0
being returned in the SacRT validation report JSON file, so I am really confused how that is possible given the current code setup and also how that notice wasn't outputted.
Regarding what to do about the leading/trailing whitespaces, I would appreciate it if @Nkdiaz, @edasmalchi and @lauriemerrell all paired together to figure out what to do about it. Can you all coordinate without me to come up with a recommendation?
A few extra notes before Natalya, Eric, and I meet later this afternoon.
I ran the version 2.0.0 GTFS validator locally and confirmed that it does not raise the LeadingOrTrailingWhitespacesNotice
for this feed even though the issue is present (results attached for reference).
So I think this is a bug in the validator and not an issue with our pipeline?
sacrt_local_validation_results.json.zip
Interestingly, this query:
SELECT DISTINCT filename FROM cal-itp-data-infra.views.validation_fact_daily_feed_notices WHERE code = 'leading_or_trailing_whitespaces'
Indicates that the only files for which this error have been raised are stop_times.txt
, stops.txt
, trips.txt
, and routes.txt
(i.e., we've never had this error raised for calendar.txt
or calendar_dates.txt
which is where it's appearing for SacRT).
For documentation's sake, after meeting with @edasmalchi and @lauriemerrell, we will split this issue into 3 tickets Issue: SacRT is not appearing in reports or available for analysis: Decision: Trim SacRT and Rose specifically in the validation_fact_daily_feed_notices so it is available for reports and for Analysis. We will not apply the trimming to the data uniformly.
Issue: SacRT data appears to be inherently corrupted with leading white spaces. if possible we should have them standardize the white spaces before we ingest it Decision: reach out to team members at calitp who can help advise a way to contact SacRT
Issue: We were unable to view if a calitp_id is dropped from a join like SacRT and Roseville were for validation_fact_daily_feed_notices. Decision: Research a way to log when an entry is dropped from a table join in bigquery
(ope sorry @Nkdiaz I was drafting this when you posted.... posting anyway for additional detail)
@Nkdiaz, @edasmalchi, and I met today to discuss this issue.
Conversation takeaways:
Next steps:
service_id
) for SacRT and Roseville since we know they are affected and to support analysts having access to the data. Submitted as #943.Ok one (last? 🤞) update. Per Slack discussion, it sounds like there is a preference to just trim whitespace more generally. Closing #943 above and creating #946 instead.
See #946
Describe the bug The October and November published reports and the draft December report for SacRT show 0 service activity. Service hours for all days for all three months are 0.
However this agency was active during these months.
Not sure if this is an error with the data infra side (is the data being pulled) or the report.... Submitting on infra side for now?
To Reproduce Steps to reproduce the behavior:
Expected behavior Should show some positive active hours for this agency.