CivicTechTO / ttc_subway_times

A scraper to grab and publish TTC subway arrival times.
GNU General Public License v3.0
40 stars 30 forks source link

Malformed JSON error not the most descriptive #51

Closed radumas closed 5 years ago

radumas commented 5 years ago

Periodically we get the below errors (the number in (275) is the line number in ttc_api_scraper.py printing that log message).

2018-11-20 20:08:09,876 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:09,876 (276): <generator object ClientResponse.text at 0x7f851c4ecba0>
2018-11-20 20:08:09,886 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:09,886 (274): Malformed JSON for station 52 on line 2
2018-11-20 20:08:09,886 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:09,876 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:09,876 (276): <generator object ClientResponse.text at 0x7f851c4ecba0>
2018-11-20 20:08:09,886 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:09,886 (274): Malformed JSON for station 52 on line 2
2018-11-20 20:08:09,886 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:09,886 (276): <generator object ClientResponse.text at 0x7f851c4d7f10>
2018-11-20 20:08:09,967 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:09,968 (274): Malformed JSON for station 59 on line 2
2018-11-20 20:08:09,968 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:09,968 (276): <generator object ClientResponse.text at 0x7f851c4e72b0>
2018-11-20 20:08:10,034 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:10,034 (274): Malformed JSON for station 66 on line 4
2018-11-20 20:08:10,034 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:10,034 (276): <generator object ClientResponse.text at 0x7f851c4ec620>
2018-11-20 20:08:10,059 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:10,059 (274): Malformed JSON for station 65 on line 4
2018-11-20 20:08:10,059 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:10,059 (276): <generator object ClientResponse.text at 0x7f851c4ec360>
2018-11-20 20:08:10,075 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:10,076 (274): Malformed JSON for station 60 on line 2
2018-11-20 20:08:10,076 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:10,076 (276): <generator object ClientResponse.text at 0x7f851c4e7518>
2018-11-20 20:08:10,098 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:10,098 (274): Malformed JSON for station 58 on line 2
2018-11-20 20:08:10,098 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:10,098 (276): <generator object ClientResponse.text at 0x7f851c4dffc0>
2018-11-20 20:08:10,099 (775): Attempt to decode JSON with unexpected mimetype: text/html
2018-11-20 20:08:10,099 (274): Malformed JSON for station 7 on line 1
2018-11-20 20:08:10,100 (275): Expecting value: line 1 column 1 (char 0)
2018-11-20 20:08:10,100 (276): <generator object ClientResponse.text at 0x7f851c4ecf10>

But the scraper then sleeps for 2s for that station, and tries 3 more times, and typically succeeds.

rvilim commented 5 years ago

(Posted this on slack but reposting here for completeness)

I've seen this before and it is implicitly solved in #50 , what happens is the TTC server returns a 404 every now and then (It's kinda flakey) which gets rendered as HTML. #50 solved this by letting it error then catching a ClientResponseError. We could probably be fancier about it, but to what end?