City-Bureau / city-scrapers-fresno

City Scrapers for Fresno
MIT License
2 stars 3 forks source link

0046 spider san joaquin river conservancy #94

Closed RenTrieu closed 2 years ago

RenTrieu commented 2 years ago

Summary

Issue: #46

Hello, here is my Pull Request that has my Spider and corresponding tests for the San Joaquin River Conservancy. This site was a little difficult to parse since all of the meeting links and labels were siblings and had inconsistent formatting. The solution I came up with was to parse through the elements sequentially into a dictionary, where the keys are meeting titles and the values are their corresponding links.

On the website, the meeting times are stated to be 10:00 AM between March and October, and 10:30 AM otherwise. I found this to be mostly consistent for the 2021 Meetings, but the 2022 meetings seem to deviate more often. The only way I see to get more accurate meeting times would be to parse the Agenda pdfs.

This is my first pull request on this project, so please let me know if there are any revisions I should make or if there are any suggestions I should consider.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Questions

I didn't see any packages like PyPDF2 in the Pipfile, but I was wondering: is that something we might add in the future to support parsing information from PDFs?

Also, should our spiders focus on parsing information from the most current meetings? Or should they be able to parse any past information that might be available on the website?

ghost commented 2 years ago

Thank you for contributing!

To answer your questions, right now some of the other scrapers are using pdfminer.six, that'd be preferred but I'm open to including other packages as needed.

Focusing on current information makes sense, there are far too many variations to worry about all historical information.