City-Bureau / city-scrapers

Scrape, standardize and share public meetings from local government websites
https://cityscrapers.org
MIT License
330 stars 310 forks source link

1020 spider chi design #1023

Open hamma95 opened 2 years ago

hamma95 commented 2 years ago

Summary

Issue: #1020

Replace "ISSUE_NUMBER" with the number of your issue so that GitHub will link this pull request with the issue and make review easier.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

Questions

1) I used a third party library (w3lib) to conveniently remove html tags. I don't know if that's okay which is why i didn't include it in the requirement files yet. I could implement the functionality without the library but the remove_tags method is much more convenient, and could be useful for other spiders too.

2) The test for the links is set to xfail because there was a unicode character in the result, and in actual result it represented with ascii characters, so should I change the ascii to unicode, or keep the ascii ?

3) In the links field, there might be some links with the same href but with different titles, like in this example. some titles are more descriptive than others, or contain more info like the zoom password. should I keep the duplicate hrefs or remove them ?

Include any questions you have about what you're working on.