Closed reginafcompton closed 5 years ago
import_data
import_data
downloads 1781 bills on 2019-01-22 at 9:35 am. After which, update_index
and convert_rtf
do not run. That's very strange. import_data
runs again at 2019-01-22 at 12:05 pm, and downloads only 84 bills.
update_index
update_index
runs on 2019-01-22 at 12:06:51 pm. It indexes 1598 nyc bills. The process requires ends at 12:08:52 pm. The next update occurs on 2019-01-22 at 13:37:14. It indexes 1607 nyc bills.
convert_rtf
convert_rtf
runs on 2019-01-22 12:08:52. It finishes at 13:22:38 – so, here, the conversion script clearly causes the "backup" in the data pipeline.
Some questions:
update_index
update so many bills, when the previous import only downloads 84 bills?age
argument do? can it be more precise?I found the source of the issue: import_data
. Here's what's going on.
import_data
determines which entities have been newly updated by querying the raw tables, i.e., the tables populated with data in the downloads
directory. For example, in the case of bills, import_data
executes the following queries:
Put data into change_bill table
INSERT INTO change_bill
SELECT raw.ocd_id
FROM raw_bill AS raw
JOIN councilmatic_core_bill AS dat
ON raw.ocd_id = dat.ocd_id
....where clause to handle null values....
Update existing entries in councilmatic_core_bill
, if they are in the change_bill
table
UPDATE councilmatic_core_bill SET
"ocd_id"=s."ocd_id", "ocd_created_at"=s."ocd_created_at", "ocd_updated_at"=s."ocd_updated_at", "description"=s."description", "identifier"=s."identifier", "bill_type"=s."bill_type", "classification"=s."classification", "source_url"=s."source_url", "source_note"=s."source_note", "from_organization_id"=s."from_organization_id", "full_text"=s."full_text", "ocr_full_text"=s."ocr_full_text", "html_text"=s."html_text", "abstract"=s."abstract", "last_action_date"=s."last_action_date", "legislative_session_id"=s."legislative_session_id", "slug"=s."slug", "updated_at"=s."updated_at"
FROM (
SELECT
raw."ocd_id", raw."ocd_created_at", raw."ocd_updated_at", raw."description", raw."identifier", raw."bill_type", raw."classification", raw."source_url", raw."source_note", raw."from_organization_id", raw."full_text", raw."ocr_full_text", raw."html_text", raw."abstract", raw."last_action_date", raw."legislative_session_id", raw."slug", raw."updated_at"
FROM raw_bill AS raw
JOIN change_bill AS change
ON raw.ocd_id = change.ocd_id
) AS s
WHERE councilmatic_core_bill.ocd_id = s.ocd_id
I noticed this problem when I found that the number of bills updated far exceeded the number downloaded, e.g.:
----------------------------------------------------------------------
2019-01-24 15:50:56 Downloaded 133 bills
----------------------------------------------------------------------
...fun ASCI art...
2019-01-24 15:50:56 Importing bills ...
2019-01-24 15:51:35 Inserted a total of 1967 raw bills
2019-01-24 15:51:43 Inserted 7143 actions
2019-01-24 15:51:49 Found 1915 changed bill
Marking bills with a fresh updated_at
timestamp causes issues further down the data pipeline, mainly, in update_index
and convert_rtf
. Both depend on the updated_timestamp
: so, if thousands of bills are marked as being updated, the Solr update and conversion script have to handle thousands of entries. That takes considerable time.
I think we need to implement https://github.com/datamade/django-councilmatic/issues/226. Specifically, I'd like to see the downloads directory get cleared at the end of every import. I'll put together a PR for this.
The RTF conversion script, typically, runs within 15 minutes. However, sometimes it requires additional time.
This can happen if the import adds a considerable volume of bills – that's fine! – but it also seems to happen because the conversion script determines which bills need converting by finding the max ocd_updated_at
timestamp, and comparing that to the bills' updated_at
timestamp.
The ocd timestamp occurs earlier than the updated_at
timestamp in our Councilmatic database, and so, the rtf script potentially runs against bills that have already been converted (unnecessary and time consuming). The simple fix? Do this: max_updated = Bill.objects.all().aggregate(Max('updated_at'))['updated_at__max']
RTF conversion script updated via https://github.com/datamade/django-councilmatic/pull/230. Ready to close!
Related issues:
140
135
Preliminary observations
On January 22, we deployed NYC Councilmatic to the production server, in turn clearing out the
downloads
directory. Resultantly, the data pipeline cycle completed as expected: once every four hours.Shortly thereafter, the process began to slow, with import_data executing once every one or two hours. It's not yet obvious where the cron backup occurs, nor why, though it certainly seems to be in either the updating of the solr index or the conversion of RTF.