cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Bug: Amtrak gtfs schedule zipfile is comprised of word documents (.docx) #399

Closed machow closed 2 years ago

machow commented 2 years ago

Describe the bug

the amtrak data sent to us (link to download has apparently correct GTFS Schedule data (e.g. a file named agency.docx), but it is not a plain text file.

The pipeline cannot ingest docx files, but requires plaintext CSV data named {filetype}.txt (e.g. routes.txt).

To Resolve

  1. Convert data to appropriate CSV format
  2. Rename files to end in .txt

Note that I can still add this data to the pipeline and run validation on it. The validator will likely not have much interesting to say though.

machow commented 2 years ago

Here is the response the validator will give (essentially no files in the data).

{'report': {'notices': [{'code': 'missing_required_file', 'severity': 'ERROR', 'totalNotices': 5, 'notices': [{'filename': 'stop_times.txt'}, {'filename': 'routes.txt'}, {'filename': 'trips.txt'}, {'filename': 'stops.txt'}, {'filename': 'agency.txt'}]}]}, 'system_errors': {'notices': []}}

I want to be a little bit careful adding a feed with no correctly named GTFS Schedule, since it should work okay, but we've never encountered this before...

hunterowens commented 2 years ago

looking at agency.docx

agency_id,agency_name,agency_url,agency_timezone,agency_lang
99,Altamont Corridor Express,http://www.amtrak.com,America/New_York,en
1207,null,http://www.amtrak.com,America/New_York,en
1206,null,http://www.amtrak.com,America/New_York,en
51,Amtrak,http://www.amtrak.com,America/New_York,en
174,Amtrak,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en
155,Badger Bus,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en
154,BC Ferries Connector,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en
1220,null,http://www.amtrak.com,America/New_York,en
123,Cantrail,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en
192,null,http://www.amtrak.com,America/New_York,en
1217,null,http://www.amtrak.com,America/New_York,en
117,Executive Transportation,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en
153,Express Arrow,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en
23,Indian Trails,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en
108,Martz Trailways,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en
136,Peoria Charter,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en
147,RoadRunneR Shuttle,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en
137,Smart Way Connector,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en
138,Van Galder Coach USA,https://www.amtrak.com/thruway-connecting-services-multiply-your-travel-destinations,America/New_York,en

it appears that it is probably just a matter of opening each file in word / libreoffice and going file->save as txt

machow commented 2 years ago

That makes sense and things look pretty well formatted. @Nkdiaz if you have time to save them as .txt before we pair Monday, let's get you set up to handle two GTFS data intake related tasks:

(no worries if you are wrapping up your analysis, we can reformat quickly when we pair..!)

machow commented 2 years ago

Let's add the corrected Amtrak feed in google drive to the warehouse when we pair tomorrow, and then close this.

machow commented 2 years ago

We should ingest Amtrak around midnight UTC tonight, and have results by tomorrow :)

edasmalchi commented 2 years ago

Not finding Amtrak results yet, any way to check that it ingested successfully? (I'm filtering by itp_id = 13 in gtfs_schedule.trips)

machow commented 2 years ago

I checked quickly right now, and it appears that when the pipeline goes to unzip Amtrak schedule data, it gets back a folder with the data inside (so essentially thinks amtrak data is empty). When I open it on my computer it looks fine--let me try quickly doing it with python to see what's going on...

hunterowens commented 2 years ago

I think this is fixed?

edasmalchi commented 2 years ago

Yep it's in the warehouse, I'll close.