Closed stucka closed 2 weeks ago
I am getting the same error scraping other sites including Palo Alto, CA. Here is the command string I used:
pipenv run civic-scraper scrape --download --start-date=2024-07-01 --url https://www.cityofpaloalto.org/Departments/City-Clerk/City-Meeting-Groups/Meeting-Agendas-and-Minutes
@joffemd Unfortunately we don't currently support that particular version of CivicPlus. If you'd like to find sites that civic-scraper is capable of scraping, one good place to start is the CivicPlus (or Legistar) spreadsheets linked to from our Find a Site to Scrape docs. Note that we've most heavily tested on CivicPlus, but that your mileage may vary -- those lists were compiled a few years ago and sites may have moved or updated since we first checked them. The sites flagged with yes
in the scrape
column of the CivicPlus lists are the ones most likely to be up-to-date and operational, as those are the ones we're actively scraping during this early stage of the project.
@stucka Just noticing this ticket now. The CLI error you mentioned appears to stem from a bug in the dynamic loading of platform code based on the URL. The strategy is a bit naive, and Roanoke's URL doesn't quite meet the expectations of the existing code and therefore fails to identify CivicPlus as the appropriate platform, leading to the error you mention.
We've been planning to deprecate the CLI since there's such wide variation in platforms across geography and time (i.e. many different versions of each platform type), so unfortunately we don't plan to fix the CLI bug at this point.
But the good news is that v0.2.10
of civic-scraper does handle the Python-layer bug you mention, so it's possible to whip up a simple Python script to work with that particular site (and others affected by the same bug). Here's a snippet of code that should work:
from civic_scraper.platforms import CivicPlusSite
url = "https://www.roanokeva.gov/agendacenter"
cp = CivicPlusSite(url)
results = cp.scrape(
start_date="2024-11-01",
end_date="2024-11-30",
)
for result in results:
print(result)
One last note worth mentioning in case you're actively working on Roanoke. It appears the agency now posts agendas on yet another version of CivicPlus which we don't yet support:
So I suspect the above code snippet may not ultimately prove helpful. But it could at least be used to scrape older documents if that's useful to your project. HTH!
@stucka Whoops. I read your note too quickly. It appears you encountered the same bug at the CLI layer and by invoking the Runner
(part of the Python-layer plumbing behind the CLI). That Runner
will likely be removed as part of the CLI deprecation, and the bug I was referring was yet another issue that you would have discovered by directly using the CivicPlusSite
class. Just to be clear -- it's that secondary bug which is now fixed in v0.2.10
; the Runner
bug is likely still present, but again, we're planning to remove the Runner
so unfortunately there aren't any plans to address that one.
In both Windows and Linux, using regular command line; command line through
pipenv run civic-scraper
, and in Jupyter using Runner: I get the same TypeError when trying to scrape a site of an unknown platform. As far as I know, there is no other method to trying to scrape a site of an unknown platform. This works for at least CivicPlus but I understand other platforms are supported.The error is something like: