caciviclab / disclosure-backend-static

Campaign finance data powering Open Disclosure California
https://caciviclab.org/odca-jekyll/
15 stars 13 forks source link

Select the most recent committee name when there are more than one #352

Closed ChenglimEar closed 5 months ago

ChenglimEar commented 5 months ago

This problem was found under the following circumstance:

There are some differences in the build output that are due to DISTINCT in SQL picking out different rows when there are multiple committees for the same Filer_ID. As an example, take a look at the difference for build/_data/candidates/oakland/2014-11-04/bryan-parker.json. The value for Filer_NamL is different and it comes from this query in the independent_candidate_expenditures view:

SELECT DISTINCT ON ("Filer_ID") "Filer_ID", "Filer_NamL" FROM committees WHERE "Filer_ID" = '983545'; The following query shows that there are two rows for the same Filer_ID and so the DISTINCT portion of the query picks a different one before and after the csvkit upgrade:

disclosure-backend=# SELECT "Filer_ID", "Filer_NamL" FROM committees WHERE "Filer_ID" = '983545'; Filer_ID | Filer_NamL ----------+--------------------------------------------------- 983545 | OAKPAC, Oakland Metropolitan Chamber of Commerce 983545 | OAK PAC, Oakland Metropolitan Chamber of Commerce (2 rows) If we force it to always pick the same one (like always pick the most recent one), it will help reduce the noise so that we can verify that there aren't any other differences that are more serious.