datamade / nyc-council-councilmatic

NYC Council version of Councilmatic
MIT License
7 stars 3 forks source link

Slowness of data pipeline [META] #155

Closed reginafcompton closed 5 years ago

reginafcompton commented 5 years ago

Related issues:


Preliminary observations

On January 22, we deployed NYC Councilmatic to the production server, in turn clearing out the downloads directory. Resultantly, the data pipeline cycle completed as expected: once every four hours.

Shortly thereafter, the process began to slow, with import_data executing once every one or two hours. It's not yet obvious where the cron backup occurs, nor why, though it certainly seems to be in either the updating of the solr index or the conversion of RTF.

reginafcompton commented 5 years ago

Looking at the logs

import_data import_data downloads 1781 bills on 2019-01-22 at 9:35 am. After which, update_index and convert_rtf do not run. That's very strange. import_data runs again at 2019-01-22 at 12:05 pm, and downloads only 84 bills.

update_index update_index runs on 2019-01-22 at 12:06:51 pm. It indexes 1598 nyc bills. The process requires ends at 12:08:52 pm. The next update occurs on 2019-01-22 at 13:37:14. It indexes 1607 nyc bills.

convert_rtf convert_rtf runs on 2019-01-22 12:08:52. It finishes at 13:22:38 – so, here, the conversion script clearly causes the "backup" in the data pipeline.


Some questions:

reginafcompton commented 5 years ago

I found the source of the issue: import_data. Here's what's going on.

import_data determines which entities have been newly updated by querying the raw tables, i.e., the tables populated with data in the downloads directory. For example, in the case of bills, import_data executes the following queries:

Put data into change_bill table

INSERT INTO change_bill
SELECT raw.ocd_id
FROM raw_bill AS raw
JOIN councilmatic_core_bill AS dat
ON raw.ocd_id = dat.ocd_id
....where clause to handle null values....

Update existing entries in councilmatic_core_bill, if they are in the change_bill table

UPDATE councilmatic_core_bill SET
"ocd_id"=s."ocd_id", "ocd_created_at"=s."ocd_created_at", "ocd_updated_at"=s."ocd_updated_at", "description"=s."description", "identifier"=s."identifier", "bill_type"=s."bill_type", "classification"=s."classification", "source_url"=s."source_url", "source_note"=s."source_note", "from_organization_id"=s."from_organization_id", "full_text"=s."full_text", "ocr_full_text"=s."ocr_full_text", "html_text"=s."html_text", "abstract"=s."abstract", "last_action_date"=s."last_action_date", "legislative_session_id"=s."legislative_session_id", "slug"=s."slug", "updated_at"=s."updated_at"
        FROM (
                SELECT
                raw."ocd_id", raw."ocd_created_at", raw."ocd_updated_at", raw."description", raw."identifier", raw."bill_type", raw."classification", raw."source_url", raw."source_note", raw."from_organization_id", raw."full_text", raw."ocr_full_text", raw."html_text", raw."abstract", raw."last_action_date", raw."legislative_session_id", raw."slug", raw."updated_at"
              FROM raw_bill AS raw
              JOIN change_bill AS change
              ON raw.ocd_id = change.ocd_id
        ) AS s
        WHERE councilmatic_core_bill.ocd_id = s.ocd_id

I noticed this problem when I found that the number of bills updated far exceeded the number downloaded, e.g.:

----------------------------------------------------------------------
  2019-01-24 15:50:56 Downloaded 133 bills  
----------------------------------------------------------------------

 ...fun ASCI art...

2019-01-24 15:50:56 Importing bills ...               
2019-01-24 15:51:35 Inserted a total of 1967 raw bills

2019-01-24 15:51:43 Inserted 7143 actions

2019-01-24 15:51:49 Found 1915 changed bill

Marking bills with a fresh updated_at timestamp causes issues further down the data pipeline, mainly, in update_index and convert_rtf. Both depend on the updated_timestamp: so, if thousands of bills are marked as being updated, the Solr update and conversion script have to handle thousands of entries. That takes considerable time.

Proposed solution

I think we need to implement https://github.com/datamade/django-councilmatic/issues/226. Specifically, I'd like to see the downloads directory get cleared at the end of every import. I'll put together a PR for this.

reginafcompton commented 5 years ago

The RTF conversion script, typically, runs within 15 minutes. However, sometimes it requires additional time.

This can happen if the import adds a considerable volume of bills – that's fine! – but it also seems to happen because the conversion script determines which bills need converting by finding the max ocd_updated_at timestamp, and comparing that to the bills' updated_at timestamp.

The ocd timestamp occurs earlier than the updated_at timestamp in our Councilmatic database, and so, the rtf script potentially runs against bills that have already been converted (unnecessary and time consuming). The simple fix? Do this: max_updated = Bill.objects.all().aggregate(Max('updated_at'))['updated_at__max']

reginafcompton commented 5 years ago

RTF conversion script updated via https://github.com/datamade/django-councilmatic/pull/230. Ready to close!