City-Bureau / city-scrapers

Scrape, standardize and share public meetings from local government websites
https://cityscrapers.org
MIT License
330 stars 310 forks source link

WIP 0880 spider chi ssa 20 #989

Open guytet opened 3 years ago

guytet commented 3 years ago

Summary

Issue: #ISSUE_NUMBER

Replace "ISSUE_NUMBER" with the number of your issue so that GitHub will link this pull request with the issue and make review easier.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Questions

An issue I'm encountering while trying to recurs over <h3>2019 SSA Meetings</h3> , as this tag and its siblings are where meeting information is presented. I was able to move around the tree using: h2 = response.xpath("//h2[contains(text(), 'SPECIAL SERVICE AREAS')]/following-sibling::p/text()") - but, I understood contains() may not be the best route in this case (white space changes).

In current state, I'm able to parse all h3 in the page, get the one I'm interested in 2019 SSA Meetings, but from that point on - I cannot use xpath in order to traverse the tree, as the h3 is an iterated item with no xpath siblings.

I've tried to use relative xpath paths such as (./) within the iterated item, but it seems clear that when the regex matches - I'm holding a single xpath element, and possibly nothing more.

I'd be happy to receive some guidelines as to how I can keep on solving this issue. Thank you.

guytet commented 3 years ago

Changed the logic and pushed. for white spaces and lower case, added

base = [ re.sub(r"\s+", " ", item).lower() for item in base ]

When the spider is run, the result is now:

2019 ssa meetings
ssa 20:

wednesday, june 5, 9 a.m.
beverly bank & trust, 10258 s. western ave.
wednesday, july 10, 9 a.m.
beverly bank & trust, 10258 s. western ave.

I hope that from this point onward, I can move to processing the start times using the parse_start method. Does this look decent enough in order to move onto parse_start :) ?

pjsier commented 3 years ago

@guytet sorry for the delay, I think that's a good next step!

guytet commented 3 years ago

@guytet sorry for the delay, I think that's a good next step!

No problem @pjsier :) . Thank you for the feedback.

guytet commented 3 years ago

At current state, I believe(hope), the naive datetime object is being passed correctly by_parse_start(), if this seems satisfactory I can move forward, if not - please note what should I improve/change and i'll be happy to keep working on it.

guytet commented 3 years ago

At current state, I believe(hope), the naive datetime object is being passed correctly by_parse_start(), if this seems satisfactory I can move forward, if not - please note what should I improve/change and i'll be happy to keep working on it.

Hmm.. something is broken in the last commit, please ignore the above, I'm back to working on it :)

guytet commented 3 years ago

Updates for the last commit:

I've looked in other spiders under `city_scrapers/spiders. Please review and let me know which items require more work. Thank you.

pjsier commented 3 years ago

@guytet Thanks for the updates! It looks like the test file isn't in version control though?

guytet commented 3 years ago

@guytet Thanks for the updates! It looks like the test file isn't in version control though?

@pjsier You're right. I will look into that. (Sorry for delayed response, for some reason I didn't see an email from github about your comment).

guytet commented 3 years ago

Anther comment I'd like to make - I'm working on another version of this spider, however it could take more time. The main challenge I see is - this page handles two SSA's (20 and 64), and so - not all meetings and dates are data we'd like to process or deliver.

I'm sure the next item is not unique, none the less it is challenging: The meeting info is posted using separate HTML tags which may not keep the same logic when newer meetings are added. (Trying to look into the future being only 2019 meetings are posted for now), and somehow make sure when future year meetings are posted, the spider is still valid. )

guytet commented 3 years ago

Added the test file. Please let me know if anything needs improving.

guytet commented 3 years ago

Thanks for your work on this! Let me know if any of the comments aren't clear

Thank you @pjsier , I will review your insightful input and adjust accordingly, will of course ask if anything's unclear.

guytet commented 3 years ago

Well, this just in: SSA 20/64 has meetings scheduled for 2021 ! (and things have changed a bit, on the page) I guess it's back to the drawing table :) https://www.mpbhba.org/business-resources/