Closed zabrahams closed 2 years ago
Thanks soo much for the detailed write-up here!
To answer some of the questions that you've raised, I'll try to answer them below:
I've repeated some of your code comments below so that @lcaswell and @PaulR-Docs from the Documenters can pick it up easier.
I think for our use case, it would probably continue to split out the Planning Commission and Design Review meetings as separate organizations; (meaning Planning Commission is listed separately at https://cleveland.documenters.org/agencies/cleveland-city-planning-commission-179/ and Design Review Committees at https://cleveland.documenters.org/agencies/cleveland-design-review-advisory-committees-157/ ;
Would like to hear @PaulR-Docs thoughts on that ^^
Regarding Point 3 at https://github.com/City-Bureau/city-scrapers-cle/blob/8d9b5d85cff60223bb0d15e0e3f369853155a4fe/city_scrapers/spiders/cle_design_review.py#L52
Putting the web_ex meeting information in the time_notes field is probably a good short-term solution.
The schema that we use to parse out the information about a meeting (start time of meeting, end time, day of meeting, physical location, etc) and assign it to a corresponding field does not have a field for information about how to access a meeting online.
Other information that doesn't fall in neatly into one of the meeting fields in the schema has been included in time_notes field for other scrapers.
(For example, Cuyahoga County Library - code https://github.com/City-Bureau/city-scrapers-cle/blob/00f5db64b64e1ea0ddfa3b7f53d1f5db26c1042d/city_scrapers/spiders/cuya_library.py#L64 is also shown visibily https://cleveland.documenters.org/meetings/finance-committee-43120/)
@lcaswell, @PaulR-Docs What this is means is that the information to join the meeting online would appear just below the meeting time on an meeting's page (similar to how "Details may change, confirm with staff before attending" appears at https://cleveland.documenters.org/meetings/finance-committee-43120/)
Regarding: https://github.com/City-Bureau/city-scrapers-cle/blob/8d9b5d85cff60223bb0d15e0e3f369853155a4fe/city_scrapers/spiders/cle_design_review.py#L37 To sum up; what @zabarahms has written:
Design Review Committees meets on fixed days (e.g. "the 1st and 3rd Friday of the month") and the specific meeting dates (e.g. January 1, 2022) are not explicitly listed on the website (to scrape).
From a technical perspective, there hasn't been a best practice yet emerge of how to automatically scrape these meetings; across other cities' scrapers, even in other cities. https://github.com/City-Bureau/city-scrapers-cle/issues/5
As I understand (please correct me if I'm wrong!) @zabrahams approach is to calculate whether any meeting times and dates would happen in the next 60 days, based on those date rules (.e.g. 1st and 3rd Friday of the month), create meetings and populate them into Documenters Calendar. (since they aren't listed on the https://planning.clevelandohio.gov/designreview/schedule.php webpage)
The risk with this approach (assuming the code works exactly as intended), as @zabrahams mentioned, is that meeting dates would be listed at https://cleveland.documenters.org/meetings/ but those meetings might be rescheduled (e.g. holidays) or cancelled. We are assuming that those meetings are always happening.
cc'ing @PaulR-Docs and @lcaswell would like your thoughts on this. ^^^ Would this be problematic for the Documenters?
Lastly I know my python is kind of rough so I'm totally open to any stylistic comments! Like with setting the tmp var on line 121 in the scraper - I'm not sure that it's needed but I did it to be safe! Just let me know any changes that would help!
(@raghavrao, interested in your thoughts from a technical perspective on this ^^^ )
@zabrahams @skorasaurus Thanks for working on this! I think it is perfectly fine to keep the Design Review Board meetings and Planning commission as separate agencies. We also keep the board of zoning appeals as a separate agency even though they are part of the Planning commission too.
As far as the Web-ex meeting links go, that doesn't really need to be scraped, since our assignment-making process includes manually entering instructions and meeting links for the documenters to access the meetings. Also, these are instructions that only assigned documenters can see, so there is less of a risk of zoombombing by bots that could potentially scrape our site for meeting links.
For example, some GCRTA meetings, like the CAC meetings, are held via web-ex only and are not streamed on FB live. For those we will add the web-ex links when we turn the scraped meeting into an assignment. The RTA/transit scraper never scrapes the meeting links directly.
I would suggest making the meeting description for these Design Review Board meetings say something simple like "Meetings will be held via Web-ex." Then when our team creates those assignments, we can just go to the source site, copy the meeting webex or zoom link and put that directly into the assignment instructions.
Let me know if that works.
@zabrahams @skorasaurus We also check that meetings are happening before we make them into assignments so having the scraper just stick with the published schedule (1st and 3rd friday) is fine for now as far as assigning work to documenters is concerned.
Having a line in the description about checking with staff is great too, in case any member of the public looking at our list of public meetings and wants to attend one.
@skorasaurus @PaulR-Docs Thanks for the comments! The discussion is super helpful! I updated the PR to move what had been in the time notes into the meeting description - and I added a sentence to the upcoming meeting description saying to check with staff!
I also removed that tmp
var since it wasn't doing anything 😀
so I don't forget: I think this is ready to go, but proposed to @zabrahams in slack about abstracting the _calculate_meeting_days_per_month
and _parse_meeting_schedule_info
and a couple related functions ;
but I defer to @zabrahams whether we should abstract them out so they can be reused by other scrapers (would most appropriate place be https://github.com/City-Bureau/city-scrapers-core/blob/main/city_scrapers_core/spiders/spider.py ? )
I totally defer to you and don't want to hold this up much.
Hi! Sorry I've had a super busy week but I'm actually working on this now! I think it totally makes sense to abstract them out. I think there are 4 candidates I can see for where to put them:
https://github.com/City-Bureau/city-scrapers-core/blob/main/city_scrapers_core/spiders/spider.py
city_scrapers/mixins/cuya_county.py
city_scrapers/utils.py
(which doesn't exist here but does in other city scraper projects)city_scrapers/utils
folderI think my preference would be to go with (4)? (I put my reasoning below) Let me know what you think? And either way it wouldn't be to hard to move it around if someone changes their mind?
(1) feels to me like maybe we'd building functionality that only applies to small number of scraper into all our scrapers - it seems better to me to separate the functionality so that only the scrapers that need it pull it in? That way we keep the core api lean and flexible and treat this more like an extension?
(2) makes alot of sense, since we already have mixins, but it would be kind of inconsistent with the way the other mixins are used - they're more being treated like partial scrapers that just need some gaps filled in, as opposed to small utilities? It might cause confusion about what should count as a mixin in the future?
(3) and (4) are similar, but the current utils.py
file doesn't seem to be getting used at the moment, and with util files there's always the danger that it just becomes a grab bag of different weird stuff, so I guess I'd prefer to create a folder so we can give it it's own file?
Though maybe it would make sense to make it a utility in core if we think other cities are going to need it too?
@skorasaurus I think this is ready for rereview with abstracting out the calculator! Let me know what you think!
Thanks; on initial thought; I like option 4; since it doesn't shove everything into one file; I'll try to take a look at it more in depth tomorrow, or more likely Friday
Thanks so much for all of this work! We should keep an eye on this to ensure that it's working as intended in 2022.
Summary
Issue: #15
Checklist
All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.
Questions
I had a couple of questions! The first is that the page changed in such a way that it's not trivial to figure out upcoming meetings. There are lists of existing agendas and then it just says the meeting is on the 1st and 3rd Tuesday of the month, or something like that. The strategy I adopted was to treat existing links as authoritative and then to calculate out meeting dates for 60 days following the last agenda item for each subcommittee. I'm not sure if that's reliable enough - since there's no way to know if meetings are cancelled until after the next agenda pops up? But maybe it's better then not including future meetings? I don't know?
I think this problem comes up in a few places - the city planning scraper (#12) will have a similar issue, I saw something similar discussed in #30 and #5. If we want to do something like the calculator I wrote, it might make sense to extract it into a mixin or utility package so that we can reuse it in multiple scrapers? But again, I'm not super sure!
I also wasn't sure about how to handle the webex and covid information. I put it in the time notes since that seemed to be the only place for generic notes about how the meeting would occur? But I'm not sure if that will work, especially with how the meeting gets displayed to end users in the calendar?
Lastly I know my python is kind of rough so I'm totally open to any stylistic comments! Like with setting the tmp var on line 121 in the scraper - I'm not sure that it's needed but I did it to be safe! Just let me know any changes that would help!
Thanks!!!!