Scrape containers are killed on staging

hancush commented 3 years ago

Our scrape container runs pupa dbinit as the entry point. This command runs migrations, then loads divisions.

Loading divisions is actually a fairly heavy operation. It looks like there is a bulk option, which could reduce the overhead. But, it won't work for databases containing data, because it deletes all divisions before loading, and jurisdictions (and probably also other models) reference divisions as protected foreign keys.

We either need to:

Not run pupa dbinit every time the scrapers run.
Increase the memory allowance of scraper containers.
Reduce the memory consumption of loaddivisions.
Something else?

Need to put this down for now. Any thoughts, @fgregg?

Example log (as these are cleaned up at a regular interval from the server):

*** Reading remote log from s3://la-metro-dashboard/staging/windowed_bill_scrape_sun_thru_thu/scrape/2021-05-13T14:20:00+00:00/1.log.
[2021-05-13 14:35:07,759] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: windowed_bill_scrape_sun_thru_thu.scrape 2021-05-13T14:20:00+00:00 [queued]>
[2021-05-13 14:35:07,842] {taskinstance.py:669} INFO - Dependencies all met for <TaskInstance: windowed_bill_scrape_sun_thru_thu.scrape 2021-05-13T14:20:00+00:00 [queued]>
[2021-05-13 14:35:07,842] {taskinstance.py:879} INFO - 
--------------------------------------------------------------------------------
[2021-05-13 14:35:07,842] {taskinstance.py:880} INFO - Starting attempt 1 of 1
[2021-05-13 14:35:07,842] {taskinstance.py:881} INFO - 
--------------------------------------------------------------------------------
[2021-05-13 14:35:07,865] {taskinstance.py:900} INFO - Executing <Task(BlackboxDockerOperator): scrape> on 2021-05-13T14:20:00+00:00
[2021-05-13 14:35:07,871] {standard_task_runner.py:53} INFO - Started process 4950 to run task
[2021-05-13 14:35:08,307] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: windowed_bill_scrape_sun_thru_thu.scrape 2021-05-13T14:20:00+00:00 [running]> ip-10-0-0-244.ec2.internal
[2021-05-13 14:35:08,450] {docker.py:289} INFO - Pulling docker image datamade/scrapers-us-municipal:staging
[2021-05-13 14:35:08,620] {docker.py:303} INFO - staging: Pulling from datamade/scrapers-us-municipal
[2021-05-13 14:35:08,623] {docker.py:303} INFO - fa1690ae9228: Already exists
[2021-05-13 14:35:08,623] {docker.py:303} INFO - 3db8d401e53a: Already exists
[2021-05-13 14:35:08,623] {docker.py:303} INFO - 38693bd6abb8: Already exists
[2021-05-13 14:35:08,624] {docker.py:303} INFO - ae36bfa465c0: Already exists
[2021-05-13 14:35:08,629] {docker.py:303} INFO - 9d8d6d4d2a2f: Already exists
[2021-05-13 14:35:08,636] {docker.py:303} INFO - 79aa5350b4cc: Already exists
[2021-05-13 14:35:08,642] {docker.py:303} INFO - e46a78295aa2: Already exists
[2021-05-13 14:35:08,647] {docker.py:303} INFO - e68197e8edf7: Already exists
[2021-05-13 14:35:08,651] {docker.py:303} INFO - fdac76b4ec9f: Pulling fs layer
[2021-05-13 14:35:08,652] {docker.py:303} INFO - afabce70ec4d: Pulling fs layer
[2021-05-13 14:35:08,652] {docker.py:303} INFO - f9a8849b5b83: Pulling fs layer
[2021-05-13 14:35:08,652] {docker.py:303} INFO - 9dc1b5f247ea: Pulling fs layer
[2021-05-13 14:35:08,652] {docker.py:303} INFO - 8b1ca0c8212d: Pulling fs layer
[2021-05-13 14:35:08,652] {docker.py:303} INFO - 8b1ca0c8212d: Download complete
[2021-05-13 14:35:08,652] {docker.py:303} INFO - fdac76b4ec9f: Extracting
[2021-05-13 14:35:08,653] {docker.py:303} INFO - afabce70ec4d: Download complete
[2021-05-13 14:35:08,653] {docker.py:303} INFO - f9a8849b5b83: Download complete
[2021-05-13 14:35:08,653] {docker.py:303} INFO - 9dc1b5f247ea: Download complete
[2021-05-13 14:35:08,735] {docker.py:303} INFO - fdac76b4ec9f: Pull complete
[2021-05-13 14:35:08,743] {docker.py:303} INFO - afabce70ec4d: Extracting
[2021-05-13 14:35:09,216] {docker.py:303} INFO - afabce70ec4d: Pull complete
[2021-05-13 14:35:09,225] {docker.py:303} INFO - f9a8849b5b83: Extracting
[2021-05-13 14:35:09,423] {docker.py:303} INFO - f9a8849b5b83: Pull complete
[2021-05-13 14:35:09,430] {docker.py:303} INFO - 9dc1b5f247ea: Extracting
[2021-05-13 14:35:09,561] {docker.py:303} INFO - 9dc1b5f247ea: Pull complete
[2021-05-13 14:35:09,567] {docker.py:303} INFO - 8b1ca0c8212d: Extracting
[2021-05-13 14:35:09,759] {docker.py:303} INFO - 8b1ca0c8212d: Pull complete
[2021-05-13 14:35:09,769] {docker.py:298} INFO - Digest: sha256:29a795c741aade8c7aca36e7d5fd003978f77ced5d1ba80a00f352f1493b5ae9
[2021-05-13 14:35:09,773] {docker.py:298} INFO - Status: Downloaded newer image for datamade/scrapers-us-municipal:staging
[2021-05-13 14:35:09,773] {docker.py:224} INFO - Starting docker container from image datamade/scrapers-us-municipal:staging
[2021-05-13 14:35:12,391] {docker.py:263} INFO - 05/13/2021 09:35:12 INFO pupa_settings.OCDClient: Raven is not configured (logging is disabled). Please see the documentation for more information.
[2021-05-13 14:35:13,385] {docker.py:263} INFO - Operations to perform:
  Apply all migrations: contenttypes, core, legislative, pupa
[2021-05-13 14:35:13,549] {docker.py:263} INFO - Running migrations:
  No migrations to apply.
[2021-05-13 14:35:20,048] {docker.py:263} INFO - Killed 
[2021-05-13 14:35:21,230] {taskinstance.py:1145} ERROR - 404 Client Error: Not Found ("No such container: cbb0925580a5258370019084dca4c01adb4fb9062d5cf0ca7710c82f78ef23f6")
Traceback (most recent call last):
  File "/home/datamade/la-metro-dashboard/plugins/../operators/blackbox_docker_operator.py", line 47, in _run_image
    return super()._run_image()
  File "/home/datamade/.virtualenvs/la-metro-dashboard/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 269, in _run_image
    raise AirflowException('docker container failed: ' + repr(result))
airflow.exceptions.AirflowException: docker container failed: {'Error': None, 'StatusCode': 137}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/datamade/.virtualenvs/la-metro-dashboard/lib/python3.6/site-packages/docker/api/client.py", line 261, in _raise_for_status
    response.raise_for_status()
  File "/home/datamade/.virtualenvs/la-metro-dashboard/lib/python3.6/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.35/containers/cbb0925580a5258370019084dca4c01adb4fb9062d5cf0ca7710c82f78ef23f6?v=False&link=False&force=False

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/datamade/.virtualenvs/la-metro-dashboard/lib/python3.6/site-packages/airflow/models/taskinstance.py", line 978, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/home/datamade/.virtualenvs/la-metro-dashboard/lib/python3.6/site-packages/airflow/providers/docker/operators/docker.py", line 307, in execute
    return self._run_image()
  File "/home/datamade/la-metro-dashboard/plugins/../operators/blackbox_docker_operator.py", line 51, in _run_image
    self.cli.remove_container(self.container['Id'])
  File "/home/datamade/.virtualenvs/la-metro-dashboard/lib/python3.6/site-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/home/datamade/.virtualenvs/la-metro-dashboard/lib/python3.6/site-packages/docker/api/container.py", line 992, in remove_container
    self._raise_for_status(res)
  File "/home/datamade/.virtualenvs/la-metro-dashboard/lib/python3.6/site-packages/docker/api/client.py", line 263, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/home/datamade/.virtualenvs/la-metro-dashboard/lib/python3.6/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.NotFound: 404 Client Error: Not Found ("No such container: cbb0925580a5258370019084dca4c01adb4fb9062d5cf0ca7710c82f78ef23f6")
[2021-05-13 14:35:21,249] {taskinstance.py:1202} INFO - Marking task as FAILED.dag_id=windowed_bill_scrape_sun_thru_thu, task_id=scrape, execution_date=20210513T142000, start_date=20210513T143507, end_date=20210513T143521
[2021-05-13 14:35:23,245] {logging_mixin.py:112} INFO - [2021-05-13 14:35:23,244] {local_task_job.py:103} INFO - Task exited with return code 1

fgregg commented 3 years ago

dbinit calls loaddivision, which is from python-opencivicdata, if you look at the guts of that management command, much of the work is ultimately done by this class https://github.com/opencivicdata/python-opencivicdata/blob/master/opencivicdata/divisions.py#L16

digging deeper, you can set an environmental variable for a filepath from which to load the divisions, instead of loading all known opencivicdata divisions from github.

https://github.com/opencivicdata/python-opencivicdata/blob/master/opencivicdata/divisions.py#L24

we could use that to create a custom csv that just has the divisions we need for latmetro

hancush commented 3 years ago

Thanks for the tip, @fgregg! python-opencivicdata will only skip loading if the incoming divisions exactly match the database, so I submitted a PR to skip loading if the incoming set is a subset of the database contents: https://github.com/opencivicdata/python-opencivicdata/pull/139

Care to have a look?

Metro-Records / la-metro-dashboard

Scrape containers are killed on staging #75