Closed bendiller closed 4 years ago
Oh, and I almost forgot - I have a couple more considerations/notes for you:
Two things:
I'm not sure why that's failing. Travis CI details show that one of the test_links() assertions comes back false, but all the tests pass on my machine, and the test HTML file containing those links does still contain the link that threw the error. And actually on closer inspection, the py3.5 build fails on a different link than the others. Strange.
I haven't implemented scraping the Agenda links yet, but I wanted to push the fixes for the other comments for your review in the meantime. Of particular note: these changes removed ~60% of the overall Meetings
and kind of removed the need for some of the logic, particularly in parse_source()
and parse_location()
. Several of the looked-for conditions no longer appear in the remaining Meeting
objects. I wanted to get your approval before removing those pieces of logic, however. It's possible they could be useful still in the future, but not with the current data set. Best to get it out of there for now and re-consider if it causes a problem down the road? Or leave it in?
Okay, I'm gonna stop spamming attempts on this thing here, sorry for that - I'll just wait for feedback from you. Each one of these commits have passed all tests on my end; I tried to remove the dependency on FakeDate
, but either something else is going on or I just don't understand it well enough. I don't understand CI well enough to have any intuition about why this is failing.
Worth mentioning - I'm testing the links completely independently from the rest of the parsing to keep it as simple and atomic as possible - it's really only testing the results of the _parse_minutes
method. Those links are not being associated with Meeting objects. Maybe I need to do that.
Anyway, like I said, I'm worried you're getting spammed with notifications about this so in case that's true, I'm going to drop this until I hear from you.
Looks like it's probably the last point you mentioned about _parse_minutes
not being called. Since that isn't called, you'll probably need to have a line that runs spider._parse_minutes
right before it calls _parse_json
when setting up the fixtures. Here's an example of that (except you won't need to assign the result since it doesn't return a value):
Well, _parse_minutes
does get called - but I opted to call it within test_links
since that's the only test that utilizes it. I wonder if that's the problem - maybe I need to move it outside. If that doesn't fix it alone, then I'll run the _parse_links
method to associate the links with Meeting
objects.
I think I'm out of ideas on the CI problem for the moment. I tried moving the _parse_minutes
call between the freezer.start
and .stop
calls, and I've also now implemented the actual stitching together of the separate sets of results (meeting objects and links).
I'm really sorry for the million commits and pushes. Each of them pushes has passed all tests on my machine (and I've even deliberately broken the sample_link
string to make it fail as a sanity check). And the strangest part is that the first push of the PR passed (well, second push - first had f-strings problems).
The only other thing I'll mention is that some test files have the spider initialization between freezer.start
and .stop
, though the automatic generation done by genspider
puts it outside of that. I don't know offhand if that could cause an issue with the way I'm collecting and storing links before associating them with Meetings, and I also don't want to continue to commit and push failing attempts, but I noticed it and thought it might be worth a mention.
Not a problem! My best guess is that it's related to there being several possible outcomes from _parse_minutes
including potential network requests, but for now I think it's safe to set link_map
manually in the test setup and just verify that _parse_links
is working from that. Once that and the other comments are resolved this should be good to go
I believe all the comments and requests are complete at this point. My only remaining question is whether we ought to delete il_pollution_control.html
from the test files. It's no longer used - it was there to be scraped for the links test, but it's no longer being read by anything.
Thanks!
Thanks for this @bendiller! Merging now
Hi Patrick -
I believe this spider is ready for your review. This one was a little tricky! I had to implement something that I didn't see in any of the other spiders, which gives me some pause. I'll be happy to discuss this with you and re-work that part if you have a more preferred way to solve the problem.
The difficulty came from the fact that the links for the meeting minutes came from a separate URL, and one which required diving down a few levels to get to the PDFs. Traversing this directory structure yielded a bunch of
Requests
, which would get naturally get scheduled and run long after the primary URL was processed and turned into Meeting objects.Normally it'd be easy enough to combine these disparate data in an Item Pipeline, but since this project doesn't appear to use one-off pipelines, I looked for other possibilities, eventually settling on connecting to the
spider_idle
signal and using that to trigger parsing of the primary URL (with the bulk of the meeting data), once all the meeting minutes links had been discovered and stored. This allowed me to associate the links to the Meeting objects within the spider. It works well and I haven't found any problems with this implementation.Anyway, sorry for the novel - I hope I explained it well enough. The code isn't as convoluted as I've made it sound. If you've encountered a problem like this before and would like me to approach the solution differently, please give me some guidance on how and I'll by happy to accommodate.
Thanks!