ebmdatalab / openprescribing

A Django app providing a REST API and dashboards for the HSCIC's GP prescribing data
https://openprescribing.net
MIT License
97 stars 26 forks source link

Data import problem record #571

Open inglesp opened 7 years ago

inglesp commented 7 years ago

This issue is for tracking problems with importing data, and recording the manual steps taken to resolve them.

inglesp commented 7 years ago

Prescribing date: 2017_06

Problem: Check at top of bigquery_upload.py failed, because there are no PracticeStatistics objects from 2017_06.

Investigation:

Steps to resolve:

inglesp commented 7 years ago

Prescribing date: 2017_07

Problem: Running the July import jumped straight to updating the smoke tests.

Investigation:

Steps to resolve:

inglesp commented 7 years ago

Prescribing date: 2017_07

Problem: hscic_list_sizes failed with Could not find any data for 2017_07.

Investigation:

Steps to resolve:

inglesp commented 7 years ago

Prescribing date: 2017_07

Problem: import_ppu_savings failed with a couple of typos

Steps to resolve:

inglesp commented 7 years ago

Prescribing date: 2017_08

Problem: hscic_prescribing failed.

Investigation:

The command searches for the most recent data at http://content.digital.nhs.uk/searchcatalogue?q=title%3a%22presentation+level+data%22&sort=Most+recent, but this now redirects to https://digital.nhs.uk/article/191/Find-data-and-publications so nothing is found.

Steps to resolve:

Seb has found https://digital.nhs.uk/pubsearch?q=presentation+level+data&s=s which lists data up to July. As the August data has not yet been published, I've made the task a no-op for now, and will fix it to retrieve data from the new list later.

inglesp commented 7 years ago

Prescribing date: 2017_08

Problem: convert_hscic_prescribing failed with a TimeoutError.

Investigation:

The timeout was while waiting for this query-and-insert in append_aggregated_data_to_prescribing_table to finish. The timeout was very low (60s) from when I was testing the new BQ stuff, and I hadn't revisited it to increase it to something sensible.

Steps to resolve:

inglesp commented 7 years ago

Prescribing date: 2017_08

Problem: import_dmd_snomed failed with an ImportError

Steps to resolve:

Fixed in 269c10aba0df9159028e1581a30928ad5b9ac00c.

inglesp commented 7 years ago

Prescribing date: 2017_08

Problem: bigquery_upload failed with a couple of noddy errors caused by careless refactoring

Steps to resolve:

inglesp commented 6 years ago

Prescribing date: 2017_09

Problem: Data imported for 2017_09 was identical to data for 2017_08.

See #644 for discussion.

inglesp commented 6 years ago

Prescribing date: 2017_09

Problem: fetch_prescribing_metadata could not find any data about practices.

Steps to resolve:

I'd misspelt a filename. Fixed in #653.

inglesp commented 6 years ago

Prescribing date: 2017_09

Problem: backup_prescribing_table failed.

Investigation:

This is the second month that the task has run. It failed when looking at the backups created in the first month.

Steps to resolve:

Fixed in #654.

inglesp commented 6 years ago

Prescribing date: 2017_09

Problem: bigquery_upload failed when updating the BNF data.

Steps to resolve:

Fixed in #655.

inglesp commented 6 years ago

Prescribing date: 2017_09

Problem: import_measures failed with a 500 error

Investigation:

Nothing is logged at INFO level, so it was hard to work out how far import_measures had got. But nothing had been written to the database, so it was safe to re-run the command.

>>> MeasureValue.objects.filter(month='2017-09-01').count()
0
>>> MeasureGlobal.objects.filter(month='2017-09-01').count()
0

Steps to resolve:

Re-ran the pipeline.

inglesp commented 6 years ago

Prescribing date: 2017_09

Problem: import_measures failed when running create_omnibus_lp_view().

Investigation:

This was because the SQL used to create the view used to include the project name, which I recently removed. All the other SQL files in the measure_sql directory are used for normal queries and so don't require the project name, but this SQL is used to create a table with a view, and so does require the project name. I had missed this, as create_omnibus_lp_view didn't use gcutils.bigquery.Client.

Steps to resolve:

Fixed in #658. Hacked import_measures to skip measures before lpzomnibus, and re-ran the pipeline.

sebbacon commented 6 years ago

Prescribing date: 2017_10

Problem: import_measures failed when making lpzomnibus measure.

Investigation:

Basically bugs introduced in recent refactoring.

Steps to resolve:

Patch (to be applied) https://gist.github.com/sebbacon/2faed130e5eb5612fd4563a7408de55e

sebbacon commented 6 years ago

Prescribing date: 2017_10

Problem: BigQuery availability issues


 File "/webapps/openprescribing/openprescribing/frontend/management/commands/import_measures.py", line 365, in insert_rows_from_query
    legacy=legacy
  File "/webapps/openprescribing/openprescribing/gcutils/bigquery.py", line 215, in insert_rows_from_query
    job.begin()
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/google/cloud/bigquery/job.py", line 380, in begin
    method='POST', path=path, data=self._build_resource())
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/google/cloud/_http.py", line 290, in api_request
    headers=headers, target_object=_target_object)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/google/cloud/_http.py", line 183, in _make_request
    return self._do_request(method, url, headers, data, target_object)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/google/cloud/_http.py", line 212, in _do_request
    url=url, method=method, headers=headers, data=data)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/google/auth/transport/requests.py", line 179, in request
    method, url, data=data, headers=request_headers, **kwargs)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /bigquery/v2/projects/ebmdatalab/jobs (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f4d0923f450>: Failed to establish a new connection: [Errno 101] Network is unreachable',))```
inglesp commented 6 years ago

Prescribing date: 2017_10

Problem: Datasets too large to sort in BQ

See #699 and #722.

inglesp commented 6 years ago

Prescribing date: 2017_11

Problem: Hit IntegrityError in import_dmd

Investigation:

Here's the traceback:

Traceback (most recent call last):
  File "manage.py", line 12, in <module>
    execute_from_command_line(sys.argv)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 353, in execute_from_command_line
    utility.execute()
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 345, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/django/core/management/base.py", line 348, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/django/core/management/base.py", line 399, in execute
    output = self.handle(*args, **options)
  File "/webapps/openprescribing/openprescribing/pipeline/management/commands/fetch_and_import.py", line 12, in handle
    run_all(kwargs['year'], kwargs['month'])
  File "/webapps/openprescribing/openprescribing/pipeline/runner.py", line 432, in run_all
    run_task(task, year, month)
  File "/webapps/openprescribing/openprescribing/pipeline/runner.py", line 403, in run_task
    task.run(year, month, **kwargs)
  File "/webapps/openprescribing/openprescribing/pipeline/runner.py", line 245, in run
    call_command(*tokens)
  File "/webapps/openprescribing/openprescribing/pipeline/runner.py", line 383, in call_command
    return django_call_command(*args)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 119, in call_command
    return command.execute(*args, **defaults)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/django/core/management/base.py", line 399, in execute
    output = self.handle(*args, **options)
  File "/webapps/openprescribing/openprescribing/dmd/management/commands/import_dmd.py", line 328, in handle
    create_dmd_product()
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/django/db/transaction.py", line 223, in __exit__
    connection.commit()
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 242, in commit
    self._commit()
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 211, in _commit
    return self.connection.commit()
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/django/db/utils.py", line 95, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/webapps/openprescribing/.venv/local/lib/python2.7/site-packages/django/db/backends/base/base.py", line 211, in _commit
    return self.connection.commit()
django.db.utils.IntegrityError: update or delete on table "dmd_product" violates foreign key constraint "dmd_tariffprice_product_id_39bb1b2e_fk_dmd_product_dmdid" on table "dmd_tariffprice"
DETAIL:  Key (dmdid)=(329298009) is still referenced from table "dmd_tariffprice".

Steps to resolve:

inglesp commented 6 years ago

Prescribing date: 2018_01

Problem: import_dmd_snomed failed because manually-downloaded zip file hadn't been extracted.

Steps to resolve: Unzipped zip file manually and tried again.

inglesp commented 6 years ago

Prescribing date: 2018_01

Problem: import_dmd_snomed failed because column headers in spreadsheet had changed.

Steps to resolve: Loosened column header assertions. See https://github.com/ebmdatalab/openprescribing/commit/9b3390bb0d69f9f2cf887863668924117d85635c.

inglesp commented 6 years ago

Prescribing date: 2018_01

Problem: import_dmd_snomed failed with django.db.utils.DataError: invalid input syntax for integer: "'7264811000001106"

Investigation:

inglesp commented 6 years ago

Prescribing date: 2018_02

Problem: import_ppu_savings failed. This get_or_create raised:

django.db.utils.IntegrityError: duplicate key value violates unique constraint "dmd_product_temp_pkey"
DETAIL:  Key (dmdid)=(10000000000) already exists.

Investigation:

inglesp commented 6 years ago

Prescribing date: 2018_03

Problem: Fetchers whose data comes from NHS Digital failed because their URLs had changed. See https://digital.nhs.uk/services/organisation-data-service/data-downloads for more.

Steps to fix: See https://github.com/ebmdatalab/openprescribing/pull/855.

inglesp commented 6 years ago

Prescribing date: 2018_04

Problem: URL for patient list size data was not as expected.

One of these things is not like the others:

Steps to fix:

evansd commented 6 years ago

Prescribing date: 2018_05

Problem: The BNF code download had already been run and this resulted in confusing error when we tried to run it again (see #941)

Steps to fix: Do nothing and move on

evansd commented 6 years ago

Prescribing date: 2018_07

Problem: The practice list size file was misnamed in the July dataset. It omits the month and year in the name, whereas filenames for previous and subsequent months all contain it.

Steps to fix:

inglesp commented 6 years ago

Prescribing date: 2018_08

Problem: import_dmd failed when trying to delete a record from dmd_product because that record was referenced in dmd_tariffprice.

Investigation: The product in question is "Glycerol 4g suppositories" which has VPID 317609008. In the latest data, the VPID has changed to 35368211000001106. See #1126.

Steps to fix:

inglesp commented 6 years ago

Prescribing date: 2018_08

Problem: import_adqs failed with an exception that was caused by a bug that was fixed in #1083... except that #1083 hadn't been merged.

Steps to fix:

Note: This would have been caught had the end-to-end tests been running nightly. The weren't, so it wasn't. My fault, sorry.

inglesp commented 5 years ago

Prescribing date: 2018_09

Problem: import_dmd failed in the same way as it did last month. This is no surprise as I haven't fixed #1126.

inglesp commented 5 years ago

Prescribing date: 2018_10

Problem: fetch_prescribing_metadata_addresses and fetch_prescribing_metadata_chemicals failed because there was no data to fetch.

Investigation:

In fact, they've not fetched anything since 2018_06, but only fail if they've not been able to fetch data for the past six months.

We expect the data to be at a URL like http://datagov.ic.nhs.uk/presentation/2017_08_August/T201708ADDR+BNFT.CSV, and as far as I'm aware, we don't know where links to the data. It doesn't appear in the search at data.gov.uk.

The data is not essential for the main parts of the site. I believe it contains details of some practices before it appears in epraccur.csv, and details of names of chemicals by BNF prefix.

For now, I've marked the two tasks as successful, so we can continue with the rest of the import.

See https://github.com/ebmdatalab/openprescribing/issues/1291.

inglesp commented 5 years ago

Prescribing date: 2018_11

Problem: https://github.com/ebmdatalab/openprescribing/issues/571#issuecomment-448749032 again.

inglesp commented 5 years ago

Prescribing date: 2018_12

Problem: fetch_patient_list_size failed because the file it was expecting to get had an unexpected name. From https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice/december-2018 we expect to find a link to a file called gp-reg-pat-prac-quin-age-dec-18.csv, but instead it was called gp-reg-pat-prac-quin-age.csv.

Steps to fix:

inglesp commented 5 years ago

Prescribing date: 2018_12

Problem: fetch_prescribing_metadata failed because of a bug.

Steps to fix:

Notes:

inglesp commented 5 years ago

Prescribing date: 2018_12

Problem: import_dmd failed because it tried to import some old data that it had previously failed to import. See this import problem record and #1126. After fixing #1126 we've imported the latest dm+d data. The old data included VMPs whose IDs have been updated in the most recent data, and so did not include VMPs with the new ID, which are now referenced by TariffPrice records. This caused the import of old data to raise an IntegrityError.

Steps to fix:

Notes:

inglesp commented 5 years ago

Prescribing date: 2019_01

Problem: fetch_patient_list_size failed for same reason as it did last month

Steps to fix: As last month, downloaded file manually.

Notes: See #1577.

inglesp commented 5 years ago

Prescribing date: 2019_01

Problem: fetch_prescribing_metadata_addresses failed because the fetcher expected a .CSV extension, but the file had a .csv extension.

Steps to fix: Patched fetch_prescribing_metadata_addresses to ignore the case of the filename in a regex. See #1578.

inglesp commented 5 years ago

Prescribing date: 2019_01

Problem: import_measures failed after getting a 500 response from BQ when importing the vitb measure.

Steps to fix: Since vitb is the last measure, I imported it by itself, marked the task as having succeeded, and restarted the pipeline.

inglesp commented 5 years ago

Prescribing date: 2019_04

Problem: create_bq_views failed because of a missing project name in vw__ghost_generic_measure.sql

Steps to fix: Fixed in a4cd657.

inglesp commented 5 years ago

Prescribing date: 2019_04

Problem: import_measures failed for bdzper1000 and fungal with DataError: invalid input syntax for type json

Investigation:

Steps to fix: Fixed in 3c136a1.

inglesp commented 5 years ago

Prescribing date: 2019_04

Problem: import_measures failed for ghost_generic_measure with DataError: invalid input syntax for type json

Investigation: See #1764

Steps to fix: Hacked import_measures so that it would ignore this measure and restarted the pipeline. We should probably fix #1764 by ignoring prescribing with zero items, and will need to reimport this measure once that's done.

inglesp commented 5 years ago

Prescribing date: 2019_04

Problem: matrixstore_build is going to take a few days to run. (Why? It didn't last time.) create_views won't run until matrixstore_build is finished. The concessions pages are broken because create_views hasn't run. (#1763)

Steps to fix:

inglesp commented 5 years ago

Prescribing date: 2019_05

Problem: Deadlock, twice, during drop_partition() in import_hscic_prescribing. In both cases, the other process was a user request for a practice homepage in ncso_spending_for_entity().

Steps to fix: Restarted the pipeline, and it worked the third time. This should be safe, as drop_partition() is the first function to do any work.

inglesp commented 5 years ago

Prescribing date: 2019_05

Problem: import_measures ran, but there are no measures in the database!

Investigation: See #1889.

inglesp commented 5 years ago

Prescribing date: 2019_06

Problem: The pipeline runner crashed with:

OSError: [Errno 13] Permission denied: '/tmp/numbers-checker/20190821140722/numbers.json'

Steps to fix: Deleted /tmp/numbers-checker/*/numbers.json manually, and created last task manually with TaskLog.objects.create(year=2019, month=6, task_name="fetch_and_import", status=TaskLog.SUCCESSFUL).

See #2011.

inglesp commented 4 years ago

Prescribing data: 2019_10

Problem: Could not download PCN data. https://files.digital.nhs.uk/assets/ods/current/ePCN.zip returns a 403.

Steps to fix: It's half four on the Friday before Christmas. I marked the task as successful and restarted the pipeline.

Update: it's now hosted at https://digital.nhs.uk/binaries/content/assets/website-assets/services/ods/data-downloads-other-nhs-organisations/epcn.zip. Import it, then reimport measures for PCNs.

Update 2: I didn't reimport measures for PCNs, but have fixed this in #2389 for the 2019_11 import.

inglesp commented 4 years ago

Prescribing date: 2019_12

Problem: The format of the PCN spreadsheet has changed slightly.

Steps to fix:

Patch import_pcns.py on the server and run again.

diff --git a/openprescribing/pipeline/management/commands/import_pcns.py b/openprescribing/pipeline/management/commands/import_pcns.py
index 5f10dddf..72bf274a 100644
--- a/openprescribing/pipeline/management/commands/import_pcns.py
+++ b/openprescribing/pipeline/management/commands/import_pcns.py
@@ -14,7 +14,7 @@ class Command(BaseCommand):
     def handle(self, *args, **kwargs):
         workbook = load_workbook(kwargs["filename"])

-        details_sheet = workbook.get_sheet_by_name("PCN Details")
+        details_sheet = workbook.get_sheet_by_name("PCNDetails")
         members_sheet = workbook.get_sheet_by_name("PCN Core Partner Details")
         pcn_details = {}
         for code, name in self.get_pcn_details_from_sheet(details_sheet):
@@ -50,8 +50,8 @@ class Command(BaseCommand):
         rows = ([cell.value for cell in row] for row in sheet.rows)
         headers = next(rows)

-        PRACTICE_COL = headers.index("Partner Organisation Code")
-        PCN_COL = headers.index("PCN Code")
+        PRACTICE_COL = 0
+        PCN_COL = 4

         for n, row in enumerate(rows, start=2):
             practice_code = row[PRACTICE_COL]

Have also emailed the ODS helpdesk:

Hi,

I'm a programmer working on OpenPrescribing, and we import ODS data monthly to help power various tools on our site.

This month our automatic importer broke because the format of the epcn.xls spreadsheet (linked to from the bottom of the GP and GP practice related data) has changed slightly, deviating from the specification in "ePCN.xlsx Specification.pdf".

  • Firstly, the tab which was previously called "PCN Details" is now called "PCNDetails".
  • Secondly, on the "PCN Core Partner Details" tab the text in the column headings is now separated by new lines rather than by spaces.

Both these changes are insignificant for humans reading the data, but confuse machines. If data is not machine readable, then it can't be easily reused in other tools.

Please could you try to ensure that the format of epcn.xls stays consistent?

Thanks,

Peter

inglesp commented 4 years ago

Prescribing date: 2020_07

Problem: import_measures failed with:

  File "/webapps/openprescribing/openprescribing/frontend/management/commands/import_measures.py", line 142, in handle
    self.build_measures(measure_defs, start_date, end_date, verbose, options)
  File "/webapps/openprescribing/openprescribing/frontend/management/commands/import_measures.py", line 119, in build_measures
    calcuation.calculate(options["bigquery_only"])
  File "/webapps/openprescribing/openprescribing/frontend/management/commands/import_measures.py", line 613, in calculate
    self.calculate_practices(bigquery_only=bigquery_only)
  File "/webapps/openprescribing/openprescribing/frontend/management/commands/import_measures.py", line 631, in calculate_practices
    self.write_practice_ratios_to_database()
  File "/webapps/openprescribing/openprescribing/frontend/management/commands/import_measures.py", line 738, in write_practice_ratios_to_database
    cursor.copy_expert(copy_str % ", ".join(MEASURE_FIELDNAMES), f)
psycopg2.errors.NumericValueOutOfRange: integer out of range
CONTEXT:  COPY frontend_measurevalue, line 412988: "nimodipine,Y60,E54000016,05L,U45156,M88646,2016-04-01,0.0,2.075,0.0,0.0,"

Steps to solve: See #2847.

inglesp commented 2 years ago

Prescribing date: 2020_09

Problem: hscic_list_sizes failed with:

Traceback (most recent call last):
  File "manage.py", line 26, in <module>
    execute_from_command_line(sys.argv)
  File "/webapps/openprescribing/.venv38/lib/python3.8/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
    utility.execute()
  File "/webapps/openprescribing/.venv38/lib/python3.8/site-packages/django/core/management/__init__.py", line 375, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/webapps/openprescribing/.venv38/lib/python3.8/site-packages/django/core/management/base.py", line 323, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/webapps/openprescribing/.venv38/lib/python3.8/site-packages/django/core/management/base.py", line 364, in execute
    output = self.handle(*args, **options)
  File "/webapps/openprescribing/openprescribing/pipeline/management/commands/fetch_and_import.py", line 12, in handle
    run_all(kwargs["year"], kwargs["month"])
  File "/webapps/openprescribing/openprescribing/pipeline/runner.py", line 413, in run_all
    run_task(task, year, month)
  File "/webapps/openprescribing/openprescribing/pipeline/runner.py", line 389, in run_task
    task.run(year, month, **kwargs)
  File "/webapps/openprescribing/openprescribing/pipeline/runner.py", line 215, in run
    call_command(*tokens)
  File "/webapps/openprescribing/openprescribing/pipeline/runner.py", line 376, in call_command
    return django_call_command(*args)
  File "/webapps/openprescribing/.venv38/lib/python3.8/site-packages/django/core/management/__init__.py", line 148, in call_command
    return command.execute(*args, **defaults)
  File "/webapps/openprescribing/.venv38/lib/python3.8/site-packages/django/core/management/base.py", line 364, in execute
    output = self.handle(*args, **options)
  File "/webapps/openprescribing/openprescribing/pipeline/management/commands/hscic_list_sizes.py", line 40, in handle
    source_url = tree.xpath("//a[contains(@href, '{}')]/@href".format(filename))[0]

I can reproduce locally by running ./manage.py hscic_list_sizes 2021 9.

Steps to fix: Practice list data has been corrected and republished, so the file we're looking for wasn't listed on the site. I hacked the fetcher:

-        filename = "gp-reg-pat-prac-quin-age.csv"
+        filename = "gp-reg-pat-prac-quin-age-v2.csv"

and restarted the pipeline.