City-Bureau / city-scrapers

Scrape, standardize and share public meetings from local government websites
https://cityscrapers.org
MIT License
332 stars 310 forks source link

WIP Spider for issue #672 #929

Closed JKReynolds closed 4 years ago

JKReynolds commented 4 years ago

Spider for #672, this me and my partner mingchan96's first. We think it looks good, we wrote some tests, everything passed and the parser results look good.

JKReynolds commented 4 years ago

okay, so we’ll go ahead and work on simplifying the code and grabbing those pdf links. I must say that the logic for parsing the upcoming and past meetings would still need to be sufficiently different enough that it is probably not actually going to simplify the parsing that much to switch to parsing the individual pages. Also, the pages linked to in the sidebar are very different than the ones on the main page.

Additionally, the past meetings sometimes have posts on the sidebar and on the main page. In these cases there would be extra complexity in avoiding duplicates while maximizing info gathered. This is because there is different information in the sidebar posts than in the main page posts.

Either way, we will simplify, get the pdf links and then show you what we’ve got, thanks for the feedback.

pjsier commented 4 years ago

I'm not sure why it would be different? Here's an upcoming meeting page: https://nwheap.com/events/board-meeting-13/ and here's a past meeting page: https://nwheap.com/events/board-meeting-16/, is there a page I'm not looking at?

Once the PDF links are parsed, the spider_idle signal could cause the page to parse https://nwheap.com/events/ instead of looking at the sidebar on the minutes page, since it looks like you'll have all the meetings there

mingchan96 commented 4 years ago

We thought we needed to parse the Meet Minutes on (https://nwheap.com/category/meet-minutes-and-agendas/) to get past meetings, since the past meetings sidebar only contain certain number of meetings. The past meetings sidebar, not only contain less meetings than the listed Meet Minutes, but also don't contain links to Meet Minutes pdf. For example, the Past Meeting link (https://nwheap.com/events/board-meeting-16/) doesn't contain Meet Minutes pdf. While "Meet Minute" link (https://nwheap.com/2019/10/09/october-10-2019-meeting-minutes-and-agenda/) does contain the pdf link, but doesn't contain meeting location or time.

pjsier commented 4 years ago

Right, that's why you start parsing the links and associating them by date which seems to be there, then moving separately to the meetings. You can then use the meeting details to get the related links

mingchan96 commented 4 years ago

So you want use to start using this the All Events link (https://nwheap.com/events/) and then use Meet Minutes link (https://nwheap.com/category/meet-minutes-and-agendas/) to associate the pdf links? Also there are "Special meetings" that are not included in the "All Events" page. Do you want use to create meetings for those as well, the best way possible since there is no direct address associated?

pjsier commented 4 years ago

Yes, that's it. It looks like all meetings are at the same location, so it's fine to default there if it's not available. We should try to create a meeting for the special meetings if they aren't listed on the events page

pjsier commented 4 years ago

Closing since this hasn't been active in a while, but feel free to reopen!