azavea / pfb-network-connectivity

PFB Bicycle Network Connectivity
Other
40 stars 10 forks source link

Analysis job crashed during score import and caused API errors #907

Closed KlaasH closed 1 year ago

KlaasH commented 1 year ago

An analysis job for the city of Milton Keynes in the UK crashed somewhat late, when its status was IMPORTING, and left its record in the database in a state where any attempt to load it would cause the API to crash.

The error from Django looked like this:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/django/db/models/query.py", line 256, in __repr__
    data = list(self[:REPR_OUTPUT_SIZE + 1])
  File "/usr/local/lib/python3.10/site-packages/django/db/models/query.py", line 280, in __iter__
    self._fetch_all()
  File "/usr/local/lib/python3.10/site-packages/django/db/models/query.py", line 1324, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/usr/local/lib/python3.10/site-packages/django/db/models/query.py", line 51, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
  File "/usr/local/lib/python3.10/site-packages/django/db/models/sql/compiler.py", line 1175, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 79, in _execute
    with self.db.wrap_database_errors:
  File "/usr/local/lib/python3.10/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.DataError: cannot cast jsonb string to type integer

The CloudWatch logs are here and show a similar error (the call stack is different but the ultimate error message is the same).

I've deleted the job to resolve the API errors, but I pulled the overall_scores value first. It looked like this: { "people": { "score_original": 0, "score_normalized": 0 }, "retail": { "score_original": 0, "score_normalized": 0 }, "transit": { "score_original": 0, "score_normalized": 0 }, "recreation": { "score_original": 0, "score_normalized": 0 }, "opportunity": { "score_original": 0, "score_normalized": 0 }, "core_services": { "score_original": 0, "score_normalized": 0 }, "overall_score": { "score_original": 0, "score_normalized": 0 }, "population_total": { "score_original": "", "score_normalized": "" }, "recreation_parks": { "score_original": 0, "score_normalized": 0 }, "recreation_trails": { "score_original": 0, "score_normalized": 0 }, "core_services_doctors": { "score_original": 0, "score_normalized": 0 }, "core_services_grocery": { "score_original": 0, "score_normalized": 0 }, "core_services_dentists": { "score_original": 0, "score_normalized": 0 }, "opportunity_employment": { "score_original": 0, "score_normalized": 0 }, "total_miles_low_stress": { "score_original": 1719.368, "score_normalized": 1719.4 }, "core_services_hospitals": { "score_original": 0, "score_normalized": 0 }, "total_miles_high_stress": { "score_original": 661.978, "score_normalized": 662 }, "core_services_pharmacies": { "score_original": 0, "score_normalized": 0 }, "opportunity_k12_education": { "score_original": 0, "score_normalized": 0 }, "opportunity_higher_education": { "score_original": 0, "score_normalized": 0 }, "recreation_community_centers": { "score_original": 0, "score_normalized": 0 }, "core_services_social_services": { "score_original": 0, "score_normalized": 0 }, "opportunity_technical_vocational_college": { "score_original": 0, "score_normalized": 0 } }

So clearly this job had problems, since there were so many things missing, though it's interesting that a few things worked. I think the issue was probably with the "population_total: { "score_original": "" } } part. score_normalized for population is apparently always an empty string, but score_original is supposed to be an integer.

The thing to do first is probably try to run the same job locally and see if the scores come out similarly broken. The neighborhood can be pulled from the production site/S3 bucket, and I copied the other parameters of the job out as well: OSM extract: https://download.geofabrik.de/europe/great-britain/england-latest.osm.pbf, but we should use https://pfb-public-documents.s3.amazonaws.com/osm_data/england-2022-09-12.osm.pbf, which is the same file copied to S3 Population file: https://pfb-public-documents.s3.amazonaws.com/population/tabblock2010_99_pophu.zip Omit jobs data in analysis: checked

If it does fail in the same way, we should also look into why this particular type of failure caused the API to crash, rather than going into a failed state and getting ignored. It might make sense to have some sort of failsafe that can catch a crash and clean up in a reliable way, but it's hard to know what's possible unless we manage to reproduce it. If this does seem feasible and worth doing, it should probably be spun off into another issue.

KlaasH commented 1 year ago

Resolved by #913