Closed radumas closed 5 years ago
So I thought this might be because the new scraper might only be keeping 1 record per request-station-line-direction. But a quick look at one request revealed that to be wrong.
I wonder if maybe not all lines or all stations are being returned? Could the new scraper be failing some minutes?
All stations and lines appear in the new data, there are just consistently 1/2 the number of expected records.
with new_data as (SELECT lineid, stationid, COUNT(1) new_cnts FROM new_data.requests
NATURAL JOIN new_Data.ntas_data
WHERE create_date >= '2019-03-25 16:00' AND create_date < '2019-03-25 17:00'
GROUP BY lineid, stationid)
, old_data AS (SELECT lineid, stationid, COUNT(1) old_cnts FROM public.requests
NATURAL JOIN public.ntas_data
WHERE create_date >= '2019-03-25 16:00' AND create_date < '2019-03-25 17:00'
GROUP BY lineid, stationid)
SELECT lineid, stationid, new_cnts, old_cnts
FROM new_data
NATURAL JOIN old_data
with new_data as (SELECT date_trunc('minute', request_date) bin, COUNT(DISTINCT stationid) as new_stations, COUNT(1) new_cnts FROM new_data.requests
NATURAL JOIN new_Data.ntas_data
WHERE create_date >= '2019-03-25 16:00' AND create_date < '2019-03-25 17:00'
GROUP BY bin)
, old_data AS (SELECT date_trunc('minute', request_date) bin, COUNT(DISTINCT stationid) as old_stations, COUNT(1) old_cnts FROM public.requests
NATURAL JOIN public.ntas_data
WHERE create_date >= '2019-03-25 16:00' AND create_date < '2019-03-25 17:00'
GROUP BY bin)
SELECT bin, new_stations, old_stations, new_cnts, old_cnts
FROM new_data
NATURAL JOIN old_data
For each minute of the same hour, there are fewer stations getting scraped, but these stations are randomly and uniformly distributed (given the above result).
Conclusion: The serverless scraper is failing consistently to grab a random sample of stations.
So the scraper runs only ever other minute, which is why the data produced is less.
The cron listings need to be changed and updated.
Comparing 2019-03-25 from the old python scraper with the same date from the serverless one.
I had to convert the folder of json into a csv using the
jsons_to_csv()
function fromfetch_s3.py
'request_ts';'old_ids';'new_ids' '2019-03-25 00:00:00';23,985; '2019-03-25 01:00:00';21,499; '2019-03-25 04:00:00';2; '2019-03-25 05:00:00';10,433;6,212 '2019-03-25 06:00:00';24,546;13,126 '2019-03-25 07:00:00';25,393;13,575 '2019-03-25 08:00:00';25,444;13,509 '2019-03-25 09:00:00';24,943;13,395 '2019-03-25 10:00:00';25,279;13,582 '2019-03-25 11:00:00';25,342;13,489 '2019-03-25 12:00:00';24,901;13,393 '2019-03-25 13:00:00';25,218;13,415 '2019-03-25 14:00:00';24,899;13,355 '2019-03-25 15:00:00';25,107;13,355 '2019-03-25 16:00:00';25,061;13,488 '2019-03-25 17:00:00';25,301;13,551 '2019-03-25 18:00:00';25,309;13,520 '2019-03-25 19:00:00';24,948;13,534 '2019-03-25 20:00:00';25,401;13,605 '2019-03-25 21:00:00';24,932;13,438 '2019-03-25 22:00:00';24,393;13,051 '2019-03-25 23:00:00';23,879;12,854