0672 spider chi northwest home equity

SubtleHyperbole commented 4 years ago

Summary

Issue: #672

Replace "ISSUE_NUMBER" with the number of your issue so that GitHub will link this pull request with the issue and make review easier.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

[X] Tests are implemented
[X**] All tests are passing
[X] Style checks run (see documentation for more details)
[X] Style checks are passing
[X] Code comments from template removed

Questions

I am such a newb when it comes to using github. After royally screwing things up not once but twice, and having to literally delete the entire repository and start over, then re-adding my files, I think I got it down this time.

One thing though is odd, and its why i included the asterisk ** next to "All tests are passing" -- for some reason this final time that I re-cloned and forked the repository onto my machine, it started throwing errors concerning three spiders (unrelated to the one i'm working on) which kept throwing exceptions up for trying to import a library called pdfminer. All spiders which used these two import lines:

from pdfminer.high_level import extract_text_to_fp from pdfminer.layout import LAParams

These lines threw exceptions about there being no existing library, even though I do have pdfminer installed on my main python as well as the pipenv shell. For some reason this was suddenly an issue. I had to actually comment out the two lines on the three spiders in order to get the terminal to finish setting up my fork. I then returned the lines to normal, but later when it came time to run pytests, it threw up errors there too for those three spiders again.

Anyhoo, long winded way of saying that what I mean by the ** in the checklist is that, yes, for all MY tests for the new spider are passing. Someone else's spiders that use pdfminer, however, are having trouble all of a sudden. The three spiders are:

il_pollution_control chi_human_relations cook_emergency_telephone

(none of those three spider files or tests got added to my commit/push anyway, I'm just bringing it up here because I thought it was an odd error, especially since it didn't happen when i did this a day or so ago, however it seems to have nothing to do with what I am working on)

SubtleHyperbole commented 4 years ago

I am confused about the "status" and "id" tests in the template -- no parse functions for these exist in the template like they do for every other test. Are they necesarry? And where would they even come from? "status" sounds like the same information that's contained in "description" -- and I this particular site doesn't give each meeting a unique ID other than the date and time and title...

pjsier commented 4 years ago

@SubtleHyperbole to your question about ID and status, we've got more information on those in our docs, but in general they're a part of the Open Civic Data Event specification that we're following, and status in particular is a more structured way of tracking whether meetings are cancelled, upcoming, or in the past

SubtleHyperbole commented 4 years ago

So the successive requests are actually only needed to get links for files posted AFTER meetings which have already happened (like meetings minutes).

Following the restructuring of the website itself (which my email to their admin caused unintended), the only location where it lists future meetings dates is on the side bar to the right which exists on several different pages.

This side bar contains (at least right now) the next three future meetings, and 5-6 past meetings. While each meeting on the sidebar contains a link to its own page, that page actually contains no new information that isn't already in the side bar.

The only information that I figured would be of interest to scrape is the meetings minutes and other uploaded files for those 5-6 PAST meetings on the side bar. For some reason, the link for each meeting on the side bar doesn't actually lead to those files. Instead, separate pages for each meeting is listed on that nwheap/meeting-minutes-and-agenda main page, along with multiple additional pages in groups of ten.

So here is the thing, since theres only 5-6 meetings listed on that right sidebar now, all of their meeting-minutes-side-pages are listed on the first page of results. As it is right now, while I iterated through all the existing back pages, those include files for past meetings that go way way further back. I thought maybe I should use that list to generate the list of meetings, but unfortunately other than the date, it doesn't give any other information except the associated links.

So should I just ditch the iterative page scroll, given that, at least for right now, nothing except the first page holds any data that ends up in a yielded meeting?

SubtleHyperbole commented 4 years ago

Also i'm still not sure about where the _id and _status comes from, but really as long as it's not something I'm supposed to be scraping from the website, that's good by me.

SubtleHyperbole commented 4 years ago

Also, I think my first comment might be confusing

    # Now moving onto the main parse of the meetings list

should really say something more like

    # First create a list of all potential meetings minutes/associated files pages. Later in the main parse, we will cycle through this list to see if there is an associated files page containing stuff like meeting minutes files.  Only past meetings have these, and even then not every time.

pjsier commented 4 years ago

@SubtleHyperbole thanks for the clarification, but I think we'll still want to use Scrapy here, even if we're chaining together requests that aren't pulling links directly from a page. Here's an example of where we're doing this on il_elections

City-Bureau / city-scrapers