codeforsanjose / city-agenda-scraper

9 stars 16 forks source link

Quality Control for AHP_parser #11

Open krammy19 opened 3 years ago

krammy19 commented 3 years ago

It would be good to verify that the urls that we're pulling in are actually valid with no errors.

Can someone please do a simple loop on the AHP_parser to request the sites and pull the status codes? If we're getting anything besides 200 codes, then we have some problems.

xconnieex commented 3 years ago

Which URLs/sites are these? Is it this one: CA_city_websites_final.csv?

dineshkumar-23 commented 3 years ago

Hello, Could you please specify which URLs to check the status for?

dineshkumar-23 commented 3 years ago

Ok cool. Could you please specify the URLs to check the status? Is it one of the columns in this file 'CA_city_websites_final.csv'?

krammy19 commented 3 years ago

Sorry about the delay in responding! I'm talking about the urls returned by the html-request scraper.

I would encourage you to try running the scraper on your own to find any issues, but you can also find the output on this Google Sheet: https://docs.google.com/spreadsheets/d/11offSYz2irnjI-9tILkcI-ClclRUZ0pyhXtPy-G4i8g/edit?usp=sharing

All columns besides CITY and CITY_URL are what needs to be quality-checked.

krammy19 commented 3 years ago

Update: html-request scraper 2 has been renamed to AHP_parser