CivicTechTO / ttc_subway_times

A scraper to grab and publish TTC subway arrival times.
GNU General Public License v3.0
40 stars 30 forks source link

serverless scraper appears to consistently produce less data #61

Closed radumas closed 5 years ago

radumas commented 5 years ago

Comparing 2019-03-25 from the old python scraper with the same date from the serverless one.

I had to convert the folder of json into a csv using the jsons_to_csv() function from fetch_s3.py

WITH new_data AS (SELECT date_trunc('hour', create_date) request_ts, COUNT(id) AS new_ids
FROM requests_serverless 
RIGHT OUTER JOIN ntas_data_serverless serverless ON serverless.requestid =  requests_serverless.requestid
WHERE create_date >= '2019-03-25 '  AND 
create_date < '2019-03-26'
GROUP BY request_ts),
old_data AS (SELECT date_trunc('hour', requests.create_date) request_ts, COUNT(id) AS old_ids
FROM requests
RIGHT OUTER JOIN ntas_data USING(requestid)
WHERE requests.create_date >= '2019-03-25'  AND 
requests.create_date < '2019-03-26'
GROUP BY request_ts)

SELECT request_ts, old_ids, new_ids
FROM new_data
RIGHT OUTER JOIN old_data USING(request_ts)
ORDER BY request_ts

'request_ts';'old_ids';'new_ids' '2019-03-25 00:00:00';23,985; '2019-03-25 01:00:00';21,499; '2019-03-25 04:00:00';2; '2019-03-25 05:00:00';10,433;6,212 '2019-03-25 06:00:00';24,546;13,126 '2019-03-25 07:00:00';25,393;13,575 '2019-03-25 08:00:00';25,444;13,509 '2019-03-25 09:00:00';24,943;13,395 '2019-03-25 10:00:00';25,279;13,582 '2019-03-25 11:00:00';25,342;13,489 '2019-03-25 12:00:00';24,901;13,393 '2019-03-25 13:00:00';25,218;13,415 '2019-03-25 14:00:00';24,899;13,355 '2019-03-25 15:00:00';25,107;13,355 '2019-03-25 16:00:00';25,061;13,488 '2019-03-25 17:00:00';25,301;13,551 '2019-03-25 18:00:00';25,309;13,520 '2019-03-25 19:00:00';24,948;13,534 '2019-03-25 20:00:00';25,401;13,605 '2019-03-25 21:00:00';24,932;13,438 '2019-03-25 22:00:00';24,393;13,051 '2019-03-25 23:00:00';23,879;12,854

radumas commented 5 years ago

So I thought this might be because the new scraper might only be keeping 1 record per request-station-line-direction. But a quick look at one request revealed that to be wrong.

I wonder if maybe not all lines or all stations are being returned? Could the new scraper be failing some minutes?

Comparing stations & lines

All stations and lines appear in the new data, there are just consistently 1/2 the number of expected records.

with new_data as (SELECT lineid, stationid, COUNT(1) new_cnts FROM new_data.requests 
NATURAL JOIN new_Data.ntas_data 
WHERE create_date >= '2019-03-25 16:00' AND create_date < '2019-03-25 17:00'
GROUP BY lineid, stationid)
, old_data AS (SELECT lineid, stationid, COUNT(1) old_cnts FROM public.requests 
NATURAL JOIN public.ntas_data 
WHERE create_date >= '2019-03-25 16:00' AND create_date < '2019-03-25 17:00'
GROUP BY lineid, stationid)
SELECT lineid, stationid, new_cnts, old_cnts
FROM new_data
NATURAL JOIN old_data

Comparing minutes of the hour

with new_data as (SELECT date_trunc('minute', request_date) bin, COUNT(DISTINCT stationid) as new_stations, COUNT(1) new_cnts FROM new_data.requests 
NATURAL JOIN new_Data.ntas_data 
WHERE create_date >= '2019-03-25 16:00' AND create_date < '2019-03-25 17:00'
GROUP BY bin)
, old_data AS (SELECT date_trunc('minute', request_date) bin, COUNT(DISTINCT stationid) as old_stations, COUNT(1) old_cnts FROM public.requests 
NATURAL JOIN public.ntas_data 
WHERE create_date >= '2019-03-25 16:00' AND create_date < '2019-03-25 17:00'
GROUP BY bin)
SELECT bin, new_stations, old_stations, new_cnts, old_cnts
FROM new_data
NATURAL JOIN old_data

For each minute of the same hour, there are fewer stations getting scraped, but these stations are randomly and uniformly distributed (given the above result).

Conclusion: The serverless scraper is failing consistently to grab a random sample of stations.

radumas commented 5 years ago

So the scraper runs only ever other minute, which is why the data produced is less.

The cron listings need to be changed and updated.