City-Bureau / city-scrapers

Scrape, standardize and share public meetings from local government websites
https://cityscrapers.org
MIT License
330 stars 310 forks source link

New/add spider chi_ssa_35 and it's test case #991

Open sosolidkk opened 3 years ago

sosolidkk commented 3 years ago

Summary

Issue: #568

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Questions

I am having some doubts about this spider, because it has all the meetings time, date and place displayed on the website itself, but the meeting details for the current day that will happen are inside a .pdf document. So what i did was to put the .pdf document content displayed into the description field in the spider. Anyway, i don't know if what i did was the correct approach or if the right way would be to iterate over the .pdf documents and parse the data inside them as meetings.

sosolidkk commented 3 years ago

Hello @pjsier , I was updating some stopped code and I made the corrections suggested by you. I also updated the code to make the year of each item correct, since it was fixed with a datetime.today().year. The only problem I still have is your change suggestion to Minutes and Agenda on title. I can't think of a way to do this dynamically, since on the page all I have is a <h4> which is followed by several <p> tags that contain the links inside. I kind of have to count and make it a more hard coded process. Do you have any better suggestions?

pjsier commented 3 years ago

@sosolidkk thanks for the changes! I mentioned in the comment, but the href attribute usually contains "Agenda" or "Minutes" which is one way, and you could also loop through a selector that iterates through the immediate children of .content and updates the document name any time it runs into an h4

sosolidkk commented 3 years ago

Hey @pjsier , sorry for the delay. I've updated this PR with the changes that you request. Now i'm iterating over all the inner elements of the body and separating the items in groups based on their <h4> title value, that can be Agenda, Schedule or Minutes.