derhuerst / vbb-gtfs

GTFS data for Berlin & Brandenburg public transport (VBB).
https://vbb-gtfs.jannisr.de/
1 stars 0 forks source link

stops.txt contains three leading bytes which do not belong there #5

Closed derhuerst closed 6 years ago

derhuerst commented 6 years ago

The October dataset's stop_times.txt references a lot of stations that do not exist in stops.txt.

Parts of the build log of vbb-stations, taken every 100k lines:

Unknown station 130000080002 at stop_times.txt arrival with trip ID 77013114 sequence # 0
Unknown station 130000080011 at stop_times.txt arrival with trip ID 77013114 sequence # 1
Unknown station 130000080341 at stop_times.txt arrival with trip ID 77013114 sequence # 2
Unknown station 130000088241 at stop_times.txt arrival with trip ID 77013114 sequence # 3
Unknown station 130000080361 at stop_times.txt arrival with trip ID 77013114 sequence # 4
Unknown station 130000080351 at stop_times.txt arrival with trip ID 77013114 sequence # 5
Unknown station 130000082651 at stop_times.txt arrival with trip ID 77013114 sequence # 6
Unknown station 130000082661 at stop_times.txt arrival with trip ID 77013114 sequence # 7
Unknown station 130000082671 at stop_times.txt arrival with trip ID 77013114 sequence # 8
Unknown station 130000082681 at stop_times.txt arrival with trip ID 77013114 sequence # 9
Unknown station 250000006802 at stop_times.txt arrival with trip ID 76976464 sequence # 15
Unknown station 250000008402 at stop_times.txt arrival with trip ID 76976464 sequence # 16
Unknown station 250000000402 at stop_times.txt arrival with trip ID 76976464 sequence # 17
Unknown station 250000000601 at stop_times.txt arrival with trip ID 76976464 sequence # 18
Unknown station 250000002801 at stop_times.txt arrival with trip ID 76976464 sequence # 19
Unknown station 250000002001 at stop_times.txt arrival with trip ID 76976464 sequence # 20
Unknown station 250000000201 at stop_times.txt arrival with trip ID 76976464 sequence # 21
Unknown station 250000001401 at stop_times.txt arrival with trip ID 76976464 sequence # 22
Unknown station 250000001301 at stop_times.txt arrival with trip ID 76976464 sequence # 23
Unknown station 250000000702 at stop_times.txt arrival with trip ID 76976464 sequence # 24
Unknown station 100000621802 at stop_times.txt arrival with trip ID 76961184 sequence # 3
Unknown station 100000620702 at stop_times.txt arrival with trip ID 76961184 sequence # 4
Unknown station 100000621401 at stop_times.txt arrival with trip ID 76961184 sequence # 5
Unknown station 100000620901 at stop_times.txt arrival with trip ID 76961184 sequence # 6
Unknown station 100000622902 at stop_times.txt arrival with trip ID 76961184 sequence # 7
Unknown station 100000620601 at stop_times.txt arrival with trip ID 76961184 sequence # 8
Unknown station 100000620501 at stop_times.txt arrival with trip ID 76961184 sequence # 9
Unknown station 100000634102 at stop_times.txt arrival with trip ID 76961184 sequence # 10
Unknown station 100000634002 at stop_times.txt arrival with trip ID 76961184 sequence # 11
Unknown station 100000595602 at stop_times.txt arrival with trip ID 76961184 sequence # 12
Unknown station 270000042302 at stop_times.txt arrival with trip ID 79044265 sequence # 16
Unknown station 270000030702 at stop_times.txt arrival with trip ID 79044265 sequence # 17
Unknown station 270000000902 at stop_times.txt arrival with trip ID 79044265 sequence # 18
Unknown station 270000039202 at stop_times.txt arrival with trip ID 79044265 sequence # 19
Unknown station 270000010302 at stop_times.txt arrival with trip ID 79044265 sequence # 20
Unknown station 270000035102 at stop_times.txt arrival with trip ID 79044265 sequence # 21
Unknown station 270000040002 at stop_times.txt arrival with trip ID 79044265 sequence # 22
Unknown station 270000040902 at stop_times.txt arrival with trip ID 79044265 sequence # 23
Unknown station 270000039502 at stop_times.txt arrival with trip ID 79044265 sequence # 24
Unknown station 270000004702 at stop_times.txt arrival with trip ID 79044265 sequence # 25
Unknown station 060024102372 at stop_times.txt arrival with trip ID 78860513 sequence # 18
Unknown station 060180001834 at stop_times.txt arrival with trip ID 78860512 sequence # 0
Unknown station 060180002824 at stop_times.txt arrival with trip ID 78860512 sequence # 1
Unknown station 060162001814 at stop_times.txt arrival with trip ID 78860512 sequence # 2
Unknown station 060160002804 at stop_times.txt arrival with trip ID 78860512 sequence # 3
Unknown station 060160001001 at stop_times.txt arrival with trip ID 78860512 sequence # 4
Unknown station 060120003652 at stop_times.txt arrival with trip ID 78860512 sequence # 5
Unknown station 060120004621 at stop_times.txt arrival with trip ID 78860512 sequence # 6
Unknown station 060120005010 at stop_times.txt arrival with trip ID 78860512 sequence # 7
Unknown station 060100004704 at stop_times.txt arrival with trip ID 78860512 sequence # 8

Find the full log here: errlog.gz

derhuerst commented 6 years ago

Turns out this is an encoding issue. The stops.txt file contains three leading bytes which do not belong to the CSV:

xxd -l 32 stops.txt
00000000: efbb bf22 7374 6f70 5f69 6422 2c22 7374  ..."stop_id","st
00000010: 6f70 5f63 6f64 6522 2c22 7374 6f70 5f6e  op_code","stop_n
derhuerst commented 6 years ago

Removing these bytes fixes it temporarily.

dd if=stops.broken.txt of=stops.txt bs=3 skip=1
dekarl commented 6 years ago

Just stumbled upon this. The three bytes are the UTF-8 Byte Order Mark and are acceptable in GTFS files as per https://developers.google.com/transit/gtfs/reference/#file_requirements

derhuerst commented 6 years ago

thanks, i adapted my build scripts.