Metro-Records / scrapers-lametro

Open Civic Data scrapers for Los Angeles Metro transportation agency.
https://metro-records.github.io/scrapers-lametro/
MIT License
0 stars 1 forks source link

DataImportError when scraping restricted bills #30

Open antidipyramid opened 1 month ago

antidipyramid commented 1 month ago

Board reports 2024-0556 and 2024-0549 were both restricted bills that raised DataImportErrors during scraping (both when restricted and not).

2024-10-21, 15:35:25 CDT] {docker.py:391} INFO - duplicate key value violates unique constraint "councilmatic_core_bill_slug_ecb9ca6b_uniq" DETAIL: Key (slug)=(2024-0556) already exists. while importing {'identifier': '2024-0556', 'title': 'AUTHORIZE the Chief ...

antidipyramid commented 3 weeks ago

Deleting the offending bills and re-scraping fixed the issue.

This has come up again with another restricted bill with an identifier of 2024-1033:

raise DataImportError(
pupa.exceptions.DataImportError: duplicate key value violates unique constraint "councilmatic_core_bill_slug_ecb9ca6b_uniq"
DETAIL:  Key (slug)=(2024-1033) already exists.
 while importing {'identifier': '2024-1033', 'title': 'Restricted View', 'classification': ['bill'], 'subject': [], 'extras': {'restrict_view': True, 'plain_text': '', 'rtf_text': ''}, 'legislative_session_id': UUID('d5353c5e-efed-43b7-9c08-54751ed323a8'), 'from_organization_id': 'ocd-organization/f659e65f-0e12-46f2-9610-c3f1456540a2'} as <class 'opencivicdata.legislative.models.bill.Bill'>; 2002774)

There's already a bill with the same identifier in the database. One difference between the it and the scraped bill seems to be the legislative_session_id-- the one in the database has a legislative_session_id of UUID('997eda68-3c01-4378-adeb-2a009842a7b4').

The from_organization_ids are identical. Pupa uses these two attributes along with the bill's identifier to check if it needs to create a new object or update an existing one.

Since Pupa thinks it's scraping a new bill, it tries to create it but the identifier/slug clashes with the existing bill's, raising the import error.

To your knowledge, has this come up in the past, @hancush? It seems like the common thread is that all of these bills were at one time restricted.

hancush commented 3 weeks ago

@antidipyramid The conflict in legislative session is definitely to blame here. Does how we determine legislative session vary between private and public bills?

antidipyramid commented 3 weeks ago

@hancush No, it looks like it's the same for all bills.

hancush commented 3 weeks ago

It looks like we pass the matter's intro date to self.session – can that change? https://github.com/Metro-Records/scrapers-lametro/blob/b44bebba1ee10303769493fbd19dfb543f2cbbc4/lametro/bills.py#L210

antidipyramid commented 2 weeks ago

@hancush One way of dealing with this is to simply remove the legislative session from object spec during import

https://github.com/opencivicdata/pupa/blob/2f7847cb87ed467f7afeec3f51cc704471b679c1/pupa/importers/bills.py#L53-L65

Bill slugs (i.e. identifiers) already must be unique so this would allow the importer to match to the existing object and update its session.

hancush commented 2 weeks ago

Could work, @antidipyramid! I do want to understand why this is happening now, though.

antidipyramid commented 2 weeks ago

@xmedr, if you have a chance in the next two weeks, it might be a good idea to take a look at the most recent scraper updates to see if those have anything to do with this behavior re: restricted bills. I don't see an obvious link but it'd be nice to get a second opinion.

I searched for similar errors in past issues across multiple repos and didn't see anything that looked like this.