Closed ChenglimEar closed 4 months ago
@mikeubell I've added the remaining code to ensure that the ETL runs through successfully under csvkit 1.3.0. It appeared that travis was not correctly installing packages under python 3 when pip install was run with sudo
. I've corrected that, but noticed that is was running under python 3.7 instead of the configured python 3.9 under travis. Even though it appears to be working, it will be better if we can understand why travis is not running with the exact configured version of python.
@mikeubell There are some differences in the build output that are due to DISTINCT
in SQL picking out different rows when there are multiple committees for the same Filer_ID
. As an example, take a look at the difference for build/_data/candidates/oakland/2014-11-04/bryan-parker.json. The value for Filer_NamL
is different and it comes from this query in the independent_candidate_expenditures
view:
SELECT DISTINCT ON ("Filer_ID") "Filer_ID", "Filer_NamL"
FROM committees WHERE "Filer_ID" = '983545';
The following query shows that there are two rows for the same Filer_ID
and so the DISTINCT
portion of the query picks a different one before and after the csvkit upgrade:
disclosure-backend=# SELECT "Filer_ID", "Filer_NamL"
FROM committees WHERE "Filer_ID" = '983545';
Filer_ID | Filer_NamL
----------+---------------------------------------------------
983545 | OAKPAC, Oakland Metropolitan Chamber of Commerce
983545 | OAK PAC, Oakland Metropolitan Chamber of Commerce
(2 rows)
If we force it to always pick the same one (like always pick the most recent one), it will help reduce the noise so that we can verify that there aren't any other differences that are more serious. How do you feel about that? Otherwise, if you think the current set of build output differences are all legit, we can maybe merge them as is. If you're ok with current differences as is, please approve this PR and I can finally merge it. Thanks.
It would be better to join the committees table on Ballot_Measure_Election to election_name since committees change names to reflect what they are supporting or opposing for an election. If that does not work then using the newest one would be ok since we are mostly interested in the current election.
On Apr 7, 2024, at 11:11 PM, ChenglimEar @.***> wrote:
@mikeubell https://github.com/mikeubell There are some differences in the build output that are due to DISTINCT in SQL picking out different rows when there are multiple committees for the same Filer_ID. As an example, take a look at the difference for build/_data/candidates/oakland/2014-11-04/bryan-parker.json. The value for Filer_NamL is different and it comes from this query in the independent_candidate_expenditures view:
SELECT DISTINCT ON ("Filer_ID") "Filer_ID", "Filer_NamL" FROM committees WHERE "Filer_ID" = '983545'; The following query shows that there are two rows for the same Filer_ID and so the DISTINCT portion of the query picks a different one before and after the csvkit upgrade:
disclosure-backend=# SELECT "Filer_ID", "Filer_NamL" FROM committees WHERE "Filer_ID" = '983545'; Filer_ID | Filer_NamL
----------+--------------------------------------------------- 983545 | OAKPAC, Oakland Metropolitan Chamber of Commerce 983545 | OAK PAC, Oakland Metropolitan Chamber of Commerce (2 rows) If we force it to always pick the same one (like always pick the most recent one), it will help reduce the noise so that we can verify that there aren't any other differences that are more serious. How do you feel about that? Otherwise, if you think the current set of build output differences are all legit, we can maybe merge them as is. If you're ok with current differences as is, please approve this PR and I can finally merge it. Thanks.— Reply to this email directly, view it on GitHub https://github.com/caciviclab/disclosure-backend-static/pull/330#issuecomment-2041937064, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTSH6BX3OBFWZD7TAOFRGTY4IYIXAVCNFSM6AAAAAA62FZEYGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRHEZTOMBWGQ. You are receiving this because you were mentioned.
I'll create a different issue for joining committees on Ballot_Measure_Election since that might open another can of worms and do an interim fix to pick the newest one instead of using DISTINCT to randomly pick one. Hopefully that will remove noise and we can close up this PR.
@mikeubell Looking at changes to the build
directory, it's looking pretty good. Can you take a final look and approve the changes in this branch?
@mikeubell after latest auto-update when I merged from master, I'm seeing more changes to the build directory that needs explanation. Almost there, but not quite yet.
@mikeubell After taking a look at the changes to the build directory, it looks like it's due to an ordering difference for things like top lists where there's a tie between multiple choices. I think this is ok for moving forward with the csvkit upgrade. Please let me know if you'd like any other change before approving this PR.
This was not a normal change, which is evident by how long it has taken to get it to an approvable place. The messy history is representative of how upgrading csvkit was not a simple matter due to lots of hidden and intertwined problems that got in the way. The whole thing broke when it was upgraded and each fix uncovered another complex problem that needed to be fixed, so i would not interpret the commit history as representative of any kind of practice.
There will be no future view of these commits, because at the end of day, all these commits will be squashed when merged into master because the changes are tied to one thing: upgrade csvkit. I would focus more on the file changes and making sure that they are correct.
As for dbschema/ vs schema.sql, they are for two purposes:
1) The csvkit upgrade requires using an explicit schema instead of having csvkit auto-detect, which is buggy from version to version. The dbschema
directory was created to accommodate this. It contains the schema of all the tables that is created explicitly so that csvkit can import data into them. I'll see if I can add something to the README about this to make sure it's clear what these files are for.
2) In order to confidently upgrade csvkit, I had to make sure that the resulting changes to the database schema was visible in the PR through file diffs. The code for creating the schema.sql
file was created in another PR. It is a dump of the resulting database schema in postgres after make import
. This allows us to verify anything done to change the schema, such as upgrading csvkit and modifying the files in dbschema
.
@mikeubell Please have a look at the changes to the README file to explain dbschema/ and schema.sql
It looks like csvkit is a beta version (0.9.0). It needs to be upgraded to a released version to ensure that we aren't stuck with an unsupported outdated package. We need a pull request just to upgrade csvkit because the beta version has some behaviors that were changed for 1.0.0 and beyond. One big one is that tables will not be created for empty csv files. In 0.9.0, the table is created with default varchar types. There's code that seems tied to this default behavior, so we will have to carefully make sure that the code surrounding the data are also upgraded to work with the released behaviors of csvkit.
Looks like version 0.9.0 has a bug where the
Form_Type
column of theA-Contributions
table is getting populated with00:00:00
instead of the original value ofA
. We'll be fixing this in this pull request.Since this change touches the core of the code that transforms the data, we need to verify that the resulting data matches the data produced by the master branch before we approve the changes. In additional to code review, the data has to be reviewed somehow. From some initial review, it looks like the old csvkit handled data cleanup better. In the case of the candidate
Derrick Soo
, there was a newline in the name of the candidate when reading the downloaded candidates.csv file. The newline did not get imported into postgres for the old csvkit, but it did get imported for the new one. This becomes problematic downstream, so it needs to be addressed.This change will require that we adopt an explicit schema instead of an auto-detected schema. The explicit schema will be set to the current auto-detected schema except for the following corrections: