City-Bureau / city-scrapers

Scrape, standardize and share public meetings from local government websites
https://cityscrapers.org
MIT License
330 stars 310 forks source link

0847 spider corrections advisory board #994

Closed cherdeman closed 3 years ago

cherdeman commented 3 years ago

Summary

Completed spider for Illinois Department of Corrections Advisory Board meetings.

Issue: #847

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Questions

I keep getting generator objects returned from the scrapy requests rather than the actual object that I want, particularly pdf_text. This is certainly a Scrapy usage issue - do ya'll have input on how I can either get the behavior I expect or rework to be more in line with the package?

Edit 12/31: All tests are passing but I'm getting a timezone warning, not sure how big a concern that should be.

cherdeman commented 3 years ago

Thanks for making these updates! I left some comments but this is coming along really well

Thanks for the review @pjsier! I think I've changed or responded to all of your comments - let me know if there are other issues to resolve.

pjsier commented 3 years ago

Thanks for the changes! Hopefully last thing, I just noticed that there's an error in the output logs for running the scraper on the current site. More info here https://github.com/City-Bureau/city-scrapers/runs/1652131797#step:10:346, but it should show up in the output if you merge the latest changes from main (to hide the pdfminer logs) and run the scraper

cherdeman commented 3 years ago

Thanks for the changes! Hopefully last thing, I just noticed that there's an error in the output logs for running the scraper on the current site. More info here https://github.com/City-Bureau/city-scrapers/runs/1652131797#step:10:346, but it should show up in the output if you merge the latest changes from main (to hide the pdfminer logs) and run the scraper

Ah yup I see - it's looking like most of these are coming from missing times - do you have a preference for making 10:30am or 12:00am the default start? It looks like the others usually start at 10:30, though maybe we'd rather make it clear that its a guess with 12:00. Ill make the end 1.5 hours later.

pjsier commented 3 years ago

Right now any meetings with None as the end time will automatically set it to 2 hours after the start so that part's fine to ignore. For the start time, it's up to you! If it seems like 10:30 is a safe assumption that works for me, otherwise midnight is probably best