An analysis job for the city of Milton Keynes in the UK crashed somewhat late, when its status was IMPORTING, and left its record in the database in a state where any attempt to load it would cause the API to crash.
The error from Django looked like this:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python3.10/site-packages/django/db/models/query.py", line 256, in __repr__
data = list(self[:REPR_OUTPUT_SIZE + 1])
File "/usr/local/lib/python3.10/site-packages/django/db/models/query.py", line 280, in __iter__
self._fetch_all()
File "/usr/local/lib/python3.10/site-packages/django/db/models/query.py", line 1324, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/usr/local/lib/python3.10/site-packages/django/db/models/query.py", line 51, in __iter__
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
File "/usr/local/lib/python3.10/site-packages/django/db/models/sql/compiler.py", line 1175, in execute_sql
cursor.execute(sql, params)
File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 79, in _execute
with self.db.wrap_database_errors:
File "/usr/local/lib/python3.10/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/usr/local/lib/python3.10/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
django.db.utils.DataError: cannot cast jsonb string to type integer
The CloudWatch logs are here and show a similar error (the call stack is different but the ultimate error message is the same).
So clearly this job had problems, since there were so many things missing, though it's interesting that a few things worked. I think the issue was probably with the "population_total: { "score_original": "" } } part. score_normalized for population is apparently always an empty string, but score_original is supposed to be an integer.
If it does fail in the same way, we should also look into why this particular type of failure caused the API to crash, rather than going into a failed state and getting ignored. It might make sense to have some sort of failsafe that can catch a crash and clean up in a reliable way, but it's hard to know what's possible unless we manage to reproduce it. If this does seem feasible and worth doing, it should probably be spun off into another issue.
An analysis job for the city of Milton Keynes in the UK crashed somewhat late, when its status was IMPORTING, and left its record in the database in a state where any attempt to load it would cause the API to crash.
The error from Django looked like this:
The CloudWatch logs are here and show a similar error (the call stack is different but the ultimate error message is the same).
I've deleted the job to resolve the API errors, but I pulled the
overall_scores
value first. It looked like this: { "people": { "score_original": 0, "score_normalized": 0 }, "retail": { "score_original": 0, "score_normalized": 0 }, "transit": { "score_original": 0, "score_normalized": 0 }, "recreation": { "score_original": 0, "score_normalized": 0 }, "opportunity": { "score_original": 0, "score_normalized": 0 }, "core_services": { "score_original": 0, "score_normalized": 0 }, "overall_score": { "score_original": 0, "score_normalized": 0 }, "population_total": { "score_original": "", "score_normalized": "" }, "recreation_parks": { "score_original": 0, "score_normalized": 0 }, "recreation_trails": { "score_original": 0, "score_normalized": 0 }, "core_services_doctors": { "score_original": 0, "score_normalized": 0 }, "core_services_grocery": { "score_original": 0, "score_normalized": 0 }, "core_services_dentists": { "score_original": 0, "score_normalized": 0 }, "opportunity_employment": { "score_original": 0, "score_normalized": 0 }, "total_miles_low_stress": { "score_original": 1719.368, "score_normalized": 1719.4 }, "core_services_hospitals": { "score_original": 0, "score_normalized": 0 }, "total_miles_high_stress": { "score_original": 661.978, "score_normalized": 662 }, "core_services_pharmacies": { "score_original": 0, "score_normalized": 0 }, "opportunity_k12_education": { "score_original": 0, "score_normalized": 0 }, "opportunity_higher_education": { "score_original": 0, "score_normalized": 0 }, "recreation_community_centers": { "score_original": 0, "score_normalized": 0 }, "core_services_social_services": { "score_original": 0, "score_normalized": 0 }, "opportunity_technical_vocational_college": { "score_original": 0, "score_normalized": 0 } }So clearly this job had problems, since there were so many things missing, though it's interesting that a few things worked. I think the issue was probably with the
"population_total: { "score_original": "" } }
part.score_normalized
for population is apparently always an empty string, butscore_original
is supposed to be an integer.The thing to do first is probably try to run the same job locally and see if the scores come out similarly broken. The neighborhood can be pulled from the production site/S3 bucket, and I copied the other parameters of the job out as well: OSM extract: https://download.geofabrik.de/europe/great-britain/england-latest.osm.pbf, but we should use https://pfb-public-documents.s3.amazonaws.com/osm_data/england-2022-09-12.osm.pbf, which is the same file copied to S3 Population file: https://pfb-public-documents.s3.amazonaws.com/population/tabblock2010_99_pophu.zip Omit jobs data in analysis: checked
If it does fail in the same way, we should also look into why this particular type of failure caused the API to crash, rather than going into a failed state and getting ignored. It might make sense to have some sort of failsafe that can catch a crash and clean up in a reliable way, but it's hard to know what's possible unless we manage to reproduce it. If this does seem feasible and worth doing, it should probably be spun off into another issue.