City-Bureau / city-scrapers

Scrape, standardize and share public meetings from local government websites
https://cityscrapers.org
MIT License
332 stars 310 forks source link

Spider: Chicago Northwest Home Equity Assurance Program #672

Open pjsier opened 5 years ago

pjsier commented 5 years ago

URL: https://nwheap.com/category/meet-minutes-and-agendas/ Spider Name: chi_northwest_home_equity Agency Name: Chicago Northwest Home Equity Assurance Program

See the contribution guide for information on how to get started

GeorgeDubuque commented 5 years ago

I would like to take this one!

pjsier commented 5 years ago

@GeorgeDubuque sorry, I missed this initially. Right now our policy is for people to take on one at a time, so feel free to start this or the O'Hare scraper and move on to the other once you're done

mingchan96 commented 4 years ago

Hi. For a class, my partner and I are looking for an issue to contribute to. I was wondering if this issue is still up for grabs? If this isn't available then is there an issue that is still open that we can look at?

pjsier commented 4 years ago

@mingchan96 this is open, all yours if you're interested!

erikkristoferanderson commented 4 years ago

I'd like to claim this one, please.

mingchan96 commented 4 years ago

@ekand you can have it. My partner and I are currently business with other projects.

pjsier commented 4 years ago

@ekand all yours!

erikkristoferanderson commented 4 years ago

@pjsier Thanks! I'll start by studying the contributors guide and try to have something in a pull request in two weeks.

erikkristoferanderson commented 4 years ago

@pjsier Well, I'm sorry to do this again, but I'm going to bow out and release this task. I just got a job (yay!) and I'm going to proritize that for now.

pjsier commented 4 years ago

@ekand no problem, and congrats on the job!

SubtleHyperbole commented 4 years ago

Hey Pj, so I am working on this one (bc the illinois department of corrections seems to have not been doing what they are supposed to do in terms of posting info about public meetings for the last couple years), and I have a question.

It looks like, in general, the response variable used in the test .py is coming from a method called file_response which pulls a saved offline version of the webpage which was created (i think?) when the spider was created on the command line, leaving no way of pulling additional pages which might be needed to completely parse all meetings.

For the example on this issue (chi_northwest_home_equity, i think), the meetings are listed in pages of 10, with each additional page having a /page/2/ or /page/3/ on up. Normally when I would be scraping a site like this, I would use requests to try get a page, checking its status code looking for a 4## and upon reaching that 4##, to stop the scraper.

However, because the parser seems to be pulling from offline files which were taken when the spider was generated, I'm not sure what to do. I figure that on the command line when I create the spider, I could probably put in a list of urls, but on the command line I can't (or at least, don't know how to) check url's response code to know how many /page/#/'s to include on the list to go up to.

There are a few methods in the CityScrapersSpider class which sound promising, like .make_requests_from_url() but what little documentation I can see, that specific one is deprecated. Besides, i imagine there must be a general best practices which this should be accomplished. I've looked at the contributions guideline page and couldn't find it, though if I missed it, I apologize in advance.

pjsier commented 4 years ago

Hi @SubtleHyperbole, I commented on the other issue but we are still interested in agencies that aren't updating as often as they should be. If you'd like to do this one instead let me know.

For your question on file_response, we've been generally saving the HTML files for other pages manually with something like wget or curl since it's on a case-by-case basis and the template is focused on the most common cases. You can see an example of a spider with multiple pages for tests in the tests for chi_ssa_42.

For this spider, the simplest way to handle pagination is scrape the "Older posts" link each time it's on the page rather than list all of the pages up front. Because the first page already goes well back into 2019 though it may be fine to just pull the first page of results

SubtleHyperbole commented 4 years ago

great thank you!

SubtleHyperbole commented 4 years ago

Actually are you sure its ssa_42? I'm looking on their website at both https://ssa42.org/ssa-42-meeting-dates/ and https://ssa42.org/minutes-of-meetings/ but I don't see any additional pages of meetings info.

pjsier commented 4 years ago

That scraper is just one example of including an additional page. il_commerce is another example that might be more similar, but either one is following the same overall idea of downloading separate pages to HTML for tests

SubtleHyperbole commented 4 years ago

Okay so I think I have the spider for this finished and (at least from what I can see) have the tests page also finished.

Unfortunately, because I bounced around on a couple of other issues before finally landing on this one to complete, there are files within my file directory which aren't correct (they have default spiders and test pages for il_corrections, chi_housing, and cook_human_rights), so I don't want to submit a pull request because I'm pretty sure it will also try to submit these as well.

Should I just start a whole new clone directory of the project (fork? not sure the nomenclature) and start a new branch for this issue, then just copy the spider and test file over to that one, then submit the pull? er... why isn't it called a push? It seems like I'm requesting that the changes i've made locally on my laptop get PUSHED to the main project directory. Why is this called a pull request?

pjsier commented 4 years ago

@SubtleHyperbole glad to hear it! You should be able to only stage the files that are relevant and then commit those. So it could be something like this:

git add city_scrapers/spiders/chi_northwest_home_equity.py
git add tests/test_chi_northwest_home_equity.py
git commit -m "Add chi_northwest_home_equity"

And "pull request" is a GitHub-specific term (GitLab uses "merge request"), but my understanding has been because it's requesting the project maintainer to "pull" in your changes

SubtleHyperbole commented 4 years ago

oh, duh. lol that makes sense. I have a tendency to only think about things from my own perspective sometimes hah!

SubtleHyperbole commented 4 years ago

crap. I just submitted the request and realized that I never ran those code cleaners the faq says to run on the code beforehand. Lint i think?

pjsier commented 4 years ago

@SubtleHyperbole No problem! I'm not seeing the request, but it's fine to make commits to a branch after you've opened up a pull request, and that's usually the case when we review them. You can run the style checks with these commands in the docs

SubtleHyperbole commented 4 years ago

hmmm i ran those three lines of code you listed in the last post into my terminal, while inside the pipenv shell, while sitting in the directory of the main cityscrapers folder (so that the relative file paths used in the 3 lines of code would make sense).

SubtleHyperbole commented 4 years ago

(git) bash-3.2$ git add city_scrapers/spiders/chi_northwest_home_equity.py (git) bash-3.2$ git add tests/test_chi_northwest_home_equity.py (git) bash-3.2$ git commit -m "Add chi_northwest_home_equity" [0672-spider-chi_northwest_home_equity 64dfa3d] Add chi_northwest_home_equity 2 files changed, 190 insertions(+) create mode 100644 city_scrapers/spiders/chi_northwest_home_equity.py create mode 100644 tests/test_chi_northwest_home_equity.py (git) bash-3.2$

pjsier commented 4 years ago

Gotcha, that was to create a commit, but you'll need to push that and submit a pull request separately. It's usually called the "GitHub Flow" and there's more information on it here

SubtleHyperbole commented 4 years ago

Just as an update, I literally had the spider completed but out of an effort at completeness, I emailed the admin of the site to ask a question about what seemed like a small discrepancy between the lists of events (yes, on the page there seems to be multiple sources of meetings lists data), and to my chagrin i got a reply that they decided to revamp how the site provides info on the meetings.

In other words, my spider is now entirely broken LMAO. Right now I am waiting for their new system to work out a last kink, before I get back onto reworking the spider. Just wanted to update that I hadn't given up on this or anything.

Oh, also, the main events page (nwheap.com/events/) now is a 404 -- it might come back though, that is what I am waiting to find out.

pjsier commented 4 years ago

Thanks for the update! I think it's fine to submit as is for now if it's still working

KevivJaknap commented 1 year ago

Hey, I would like to tackle this issue.

haileyhoyat commented 1 year ago

@KevivJaknap Hello! Thanks so much for checking out our project. Go for it.

KevivJaknap commented 1 year ago

@haileyhoyat Just wanted to inform that I've submitted a pull request