Errors using command line, library to scrape without knowing the platform

stucka commented 10 months ago

In both Windows and Linux, using regular command line; command line through pipenv run civic-scraper, and in Jupyter using Runner: I get the same TypeError when trying to scrape a site of an unknown platform. As far as I know, there is no other method to trying to scrape a site of an unknown platform. This works for at least CivicPlus but I understand other platforms are supported.

The error is something like:

C:\data\agenda-watch-speedrun\scrapes>civic-scraper scrape --url https://www.roanokeva.gov/agendacenter
01-20 19:33 - civic_scraper.runner - Scraping 1 site(s) from 2024-01-20 to 2024-01-20...
Traceback (most recent call last):
  File "C:\Python\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Python\Scripts\civic-scraper.exe\__main__.py", line 7, in <module>
  File "C:\Python\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Python\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Python\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Python\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Python\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "C:\Python\lib\site-packages\civic_scraper\cli.py", line 90, in scrape
    runner.scrape(**kwargs)
  File "C:\Python\lib\site-packages\civic_scraper\runner.py", line 67, in scrape
    SiteClass = self._get_site_class(url)
  File "C:\Python\lib\site-packages\civic_scraper\runner.py", line 97, in _get_site_class
    return getattr(mod, class_name)
TypeError: getattr(): attribute name must be string

joffemd commented 1 month ago

I am getting the same error scraping other sites including Palo Alto, CA. Here is the command string I used:

pipenv run civic-scraper scrape --download --start-date=2024-07-01 --url https://www.cityofpaloalto.org/Departments/City-Clerk/City-Meeting-Groups/Meeting-Agendas-and-Minutes

zstumgoren commented 2 weeks ago

@joffemd Unfortunately we don't currently support that particular version of CivicPlus. If you'd like to find sites that civic-scraper is capable of scraping, one good place to start is the CivicPlus (or Legistar) spreadsheets linked to from our Find a Site to Scrape docs. Note that we've most heavily tested on CivicPlus, but that your mileage may vary -- those lists were compiled a few years ago and sites may have moved or updated since we first checked them. The sites flagged with yes in the scrape column of the CivicPlus lists are the ones most likely to be up-to-date and operational, as those are the ones we're actively scraping during this early stage of the project.

zstumgoren commented 2 weeks ago

@stucka Just noticing this ticket now. The CLI error you mentioned appears to stem from a bug in the dynamic loading of platform code based on the URL. The strategy is a bit naive, and Roanoke's URL doesn't quite meet the expectations of the existing code and therefore fails to identify CivicPlus as the appropriate platform, leading to the error you mention.

We've been planning to deprecate the CLI since there's such wide variation in platforms across geography and time (i.e. many different versions of each platform type), so unfortunately we don't plan to fix the CLI bug at this point.

But the good news is that v0.2.10 of civic-scraper does handle the Python-layer bug you mention, so it's possible to whip up a simple Python script to work with that particular site (and others affected by the same bug). Here's a snippet of code that should work:

from civic_scraper.platforms import CivicPlusSite

url = "https://www.roanokeva.gov/agendacenter"
cp = CivicPlusSite(url)
results = cp.scrape(
    start_date="2024-11-01",
    end_date="2024-11-30",
)
for result in results:
    print(result)

One last note worth mentioning in case you're actively working on Roanoke. It appears the agency now posts agendas on yet another version of CivicPlus which we don't yet support:

So I suspect the above code snippet may not ultimately prove helpful. But it could at least be used to scrape older documents if that's useful to your project. HTH!

zstumgoren commented 2 weeks ago

@stucka Whoops. I read your note too quickly. It appears you encountered the same bug at the CLI layer and by invoking the Runner (part of the Python-layer plumbing behind the CLI). That Runner will likely be removed as part of the CLI deprecation, and the bug I was referring was yet another issue that you would have discovered by directly using the CivicPlusSite class. Just to be clear -- it's that secondary bug which is now fixed in v0.2.10; the Runner bug is likely still present, but again, we're planning to remove the Runner so unfortunately there aren't any plans to address that one.

biglocalnews / civic-scraper

Errors using command line, library to scrape without knowing the platform #175