City-Bureau / city-scrapers

Scrape, standardize and share public meetings from local government websites
https://cityscrapers.org
MIT License
332 stars 310 forks source link

0805 spider il pollution control #882

Closed bendiller closed 4 years ago

bendiller commented 4 years ago

Hi Patrick -

I believe this spider is ready for your review. This one was a little tricky! I had to implement something that I didn't see in any of the other spiders, which gives me some pause. I'll be happy to discuss this with you and re-work that part if you have a more preferred way to solve the problem.

The difficulty came from the fact that the links for the meeting minutes came from a separate URL, and one which required diving down a few levels to get to the PDFs. Traversing this directory structure yielded a bunch of Requests, which would get naturally get scheduled and run long after the primary URL was processed and turned into Meeting objects.

Normally it'd be easy enough to combine these disparate data in an Item Pipeline, but since this project doesn't appear to use one-off pipelines, I looked for other possibilities, eventually settling on connecting to the spider_idle signal and using that to trigger parsing of the primary URL (with the bulk of the meeting data), once all the meeting minutes links had been discovered and stored. This allowed me to associate the links to the Meeting objects within the spider. It works well and I haven't found any problems with this implementation.

Anyway, sorry for the novel - I hope I explained it well enough. The code isn't as convoluted as I've made it sound. If you've encountered a problem like this before and would like me to approach the solution differently, please give me some guidance on how and I'll by happy to accommodate.

Thanks!

bendiller commented 4 years ago

Oh, and I almost forgot - I have a couple more considerations/notes for you:

  1. I wasn't sure how to classify "hearings". I left them as NOT_CLASSIFIED but perhaps BOARD is appropriate.
  2. Locations - those which listed both the Chicago and Springfield offices, I selected the Chicago as the location, believing that to be the main office. Let me know if you'd like something different.
  3. Locations - some of them were indeterminate; I assigned "name" and "address" as empty strings for these.
bendiller commented 4 years ago

Two things:

  1. I'm not sure why that's failing. Travis CI details show that one of the test_links() assertions comes back false, but all the tests pass on my machine, and the test HTML file containing those links does still contain the link that threw the error. And actually on closer inspection, the py3.5 build fails on a different link than the others. Strange.

  2. I haven't implemented scraping the Agenda links yet, but I wanted to push the fixes for the other comments for your review in the meantime. Of particular note: these changes removed ~60% of the overall Meetings and kind of removed the need for some of the logic, particularly in parse_source() and parse_location(). Several of the looked-for conditions no longer appear in the remaining Meeting objects. I wanted to get your approval before removing those pieces of logic, however. It's possible they could be useful still in the future, but not with the current data set. Best to get it out of there for now and re-consider if it causes a problem down the road? Or leave it in?

bendiller commented 4 years ago

Okay, I'm gonna stop spamming attempts on this thing here, sorry for that - I'll just wait for feedback from you. Each one of these commits have passed all tests on my end; I tried to remove the dependency on FakeDate, but either something else is going on or I just don't understand it well enough. I don't understand CI well enough to have any intuition about why this is failing.

Worth mentioning - I'm testing the links completely independently from the rest of the parsing to keep it as simple and atomic as possible - it's really only testing the results of the _parse_minutes method. Those links are not being associated with Meeting objects. Maybe I need to do that.

Anyway, like I said, I'm worried you're getting spammed with notifications about this so in case that's true, I'm going to drop this until I hear from you.

pjsier commented 4 years ago

Looks like it's probably the last point you mentioned about _parse_minutes not being called. Since that isn't called, you'll probably need to have a line that runs spider._parse_minutes right before it calls _parse_json when setting up the fixtures. Here's an example of that (except you won't need to assign the result since it doesn't return a value):

https://github.com/City-Bureau/city-scrapers-akr/blob/f32b12bb9791082e199e7a428e6818037b3c4754/tests/test_akr_metro_transportation_study.py#L11-L27

bendiller commented 4 years ago

Well, _parse_minutes does get called - but I opted to call it within test_links since that's the only test that utilizes it. I wonder if that's the problem - maybe I need to move it outside. If that doesn't fix it alone, then I'll run the _parse_links method to associate the links with Meeting objects.

bendiller commented 4 years ago

I think I'm out of ideas on the CI problem for the moment. I tried moving the _parse_minutes call between the freezer.start and .stop calls, and I've also now implemented the actual stitching together of the separate sets of results (meeting objects and links).

I'm really sorry for the million commits and pushes. Each of them pushes has passed all tests on my machine (and I've even deliberately broken the sample_link string to make it fail as a sanity check). And the strangest part is that the first push of the PR passed (well, second push - first had f-strings problems).

The only other thing I'll mention is that some test files have the spider initialization between freezer.start and .stop, though the automatic generation done by genspider puts it outside of that. I don't know offhand if that could cause an issue with the way I'm collecting and storing links before associating them with Meetings, and I also don't want to continue to commit and push failing attempts, but I noticed it and thought it might be worth a mention.

pjsier commented 4 years ago

Not a problem! My best guess is that it's related to there being several possible outcomes from _parse_minutes including potential network requests, but for now I think it's safe to set link_map manually in the test setup and just verify that _parse_links is working from that. Once that and the other comments are resolved this should be good to go

bendiller commented 4 years ago

I believe all the comments and requests are complete at this point. My only remaining question is whether we ought to delete il_pollution_control.html from the test files. It's no longer used - it was there to be scraped for the links test, but it's no longer being read by anything.

Thanks!

pjsier commented 4 years ago

Thanks for this @bendiller! Merging now