biglocalnews / civic-scraper

Tools for downloading agendas, minutes and other documents produced by local government
https://civic-scraper.readthedocs.io
Other
40 stars 13 forks source link

Scraper breaking on Alameda County Water District #176

Closed zstumgoren closed 2 days ago

zstumgoren commented 6 months ago

Alameda Water County Water District scrape is failing:

https://www.acwd.org/AgendaCenter

Removing the agency from our GDoc scraping list until we can debug.

Stacktrace from Prefect:

ERROR ON SCRAPER TASK for https://www.acwd.org/AgendaCenter. Here's the stack trace:
Traceback (most recent call last):
  File "/etl/utils/scrape.py", line 59, in scrape_agency
    assets_meta = site.scrape(
                  ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/site.py", line 68, in scrape
    file_metadata = self.parser_kls(raw_html).parse()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/parser.py", line 20, in parse
    metadata = self._extract_asset_data(divs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/parser.py", line 42, in _extract_asset_data
    cmte_name = self._committee_name(div)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/parser.py", line 71, in _committee_name
    div.h2.span.extract()
    ^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'span'
Traceback (most recent call last):
  File "/etl/utils/scrape.py", line 59, in scrape_agency
    assets_meta = site.scrape(
                  ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/site.py", line 68, in scrape
    file_metadata = self.parser_kls(raw_html).parse()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/parser.py", line 20, in parse
    metadata = self._extract_asset_data(divs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/parser.py", line 42, in _extract_asset_data
    cmte_name = self._committee_name(div)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/parser.py", line 71, in _committee_name
    div.h2.span.extract()
    ^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'span'
taz77 commented 2 weeks ago

This is the same as #91

It is caused by using extract() which removes the element from the bs4 tree therefore the next line to grab the H2 text does not work because the h2 has been removed by the extract.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#extract

It seems the arrow thing is not present


        # Remove span that contains
        # arrow ▼ for toggling meeting list```
zstumgoren commented 2 days ago

@taz77 apologies for the delay and many thanks for the contribution! I'm going to review and add some test coverage for this issue today. hopefully have it merged and pushed to PyPI today or early next week.

zstumgoren commented 2 days ago

@taz77 Fix is deployed to PyPI as version 0.2.10. Closing this but feel free to re-open or file a new issue if you still have problems.