biglocalnews / civic-scraper

Tools for downloading agendas, minutes and other documents produced by local government
https://civic-scraper.readthedocs.io
Other
44 stars 14 forks source link

East Brandywine failing on content-length #177

Closed zstumgoren closed 7 months ago

zstumgoren commented 7 months ago

East Brandywine, PA has html agenda version (in addition to PDFs), at least one of which is missing Content-Length in HTTP headers.

https://www.ebrandywine.org/AgendaCenter

That's causing the below stack trace.

Simple fix is to slightly loosen the screws by setting a value of -1 for the Content Length. That metadata doesn't appear to be required in downstream processes (it's stuffed as a JSON Blob in Postgres), but just in case included a number rather than None...

Tasks

Stacktrace

From Prefect Cloud metadata flow run

ERROR ON SCRAPER TASK for https://www.ebrandywine.org/AgendaCenter. Here's the stack trace:
Traceback (most recent call last):
  File "/etl/utils/scrape.py", line 59, in scrape_agency
    assets_meta = site.scrape(
                  ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/site.py", line 69, in scrape
    assets = self._build_asset_collection(file_metadata)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/site.py", line 152, in _build_asset_collection
    "content_length": headers["content-length"],
                      ~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/structures.py", line 52, in __getitem__
    return self._store[key.lower()][1]
           ~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'content-length'
Traceback (most recent call last):
  File "/etl/utils/scrape.py", line 59, in scrape_agency
    assets_meta = site.scrape(
                  ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/site.py", line 69, in scrape
    assets = self._build_asset_collection(file_metadata)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/civic_scraper/platforms/civic_plus/site.py", line 152, in _build_asset_collection
    "content_length": headers["content-length"],
                      ~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/requests/structures.py", line 52, in __getitem__
    return self._store[key.lower()][1]
           ~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'content-length'
zstumgoren commented 7 months ago

Use of None as fallback option causes the below stacktrace in civic-prefect-flow-dev:

Encountered exception during execution:
Traceback (most recent call last):
  File "/etl/flows/civic_plus/metadata.py", line 75, in save_metadata
    aw_doc = AgendaWatchDocument.objects.get(meta__document_id=doc.id)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/django/db/models/query.py", line 650, in get
    raise self.model.DoesNotExist(
documents.models.AgendaWatchDocument.DoesNotExist: AgendaWatchDocument matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/engine.py", line 2107, in orchestrate_task_run
    result = await call.aresult()
             ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 326, in aresult
    return await asyncio.wrap_future(self.future)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 351, in _run_sync
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/etl/flows/civic_plus/metadata.py", line 85, in save_metadata
    meta=doc.to_dict()
         ^^^^^^^^^^^^^
  File "/etl/utils/document.py", line 78, in to_dict
    "content_length": int(self.asset.content_length),
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
zstumgoren commented 7 months ago

Setting content-length to -1 resolved the issue and should allow us to pinpoint these records for fixes downstream.