BigQuery integration - Githubissues

Final pull request for the initial BigQuery migration

Thank you for pushing the over the finish line. I may add @Daniel Berger @.***> to the PR if you don't mind. My goal is to have this finished by the end of next weekend.

--Susama

On Sun, Sep 8, 2024 at 3:02 AM Ben Mainwaring @.***> wrote:

@mainwaringb https://github.com/mainwaringb requested your review on:

53

https://github.com/Voting-Rights-Code/Equitable-Polling-Locations/pull/53 BigQuery integration.

— Reply to this email directly, view it on GitHub https://github.com/Voting-Rights-Code/Equitable-Polling-Locations/pull/53#event-14173420158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJGCLORUN2K3VS65HFAGC3DZVPZANAVCNFSM6AAAAABN2W5ZRSVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJUGE3TGNBSGAYTKOA . You are receiving this because your review was requested.Message ID: <Voting-Rights-Code/Equitable-Polling-Locations/pull/53/issue_event/14173420158 @github.com>

@Amasus sounds good! Let me resolve the merge conflicts today - ya'll can feel free to start reviewing if you have time, but expect one final commit RE the merge issues

@Amasus merge conflicts done!

@mainwaringb, looking through the code and trying to see if I understand things. More questions added here, as opposed to the other email thread.

back_fill_data.py vs the files in BigQuery_integration/ :
- back_fill_data seems clear. It creates the schema and uploads the files. But then collate_results.R seems to do something similar?
- Also, there is a strange comment in create_views.py about views on returning states starting with W. Is this testing code? Is the comment left over?
Trying to understand how you are setting up the "other_args" setup and how / if that affects how config files need to be made moving forward
- My understanding (which may be wrong) is that the new config files need to have an other_args file, which is a dictionary in and of itself. The keys in this dictionary correspond to fields that are still in development (e.g. driving and fixed_capacity_site_number). Is this true?
- if other_args is not empty, one can only write to .csvs, and this data is not ready for prod yet. Is that true?
- I'm just realizing that the penalized_sites field is not in the schema anywhere
- I'd love to see a config file with the correct format, just to make certain I'm understanding what I need to produe. (This would be useful to @Soumyajit001 as well)
On the subject of config files, I'm noticing that there are far fewer config folders in this branch than there are in main
- For instance,Gwinnett_GA_12_for_2024_penalized_configs which has the penalized_sites field
- in a similar vein, the only config folder with a driving field is Engage_VA_2024_driving_configs, and the version in the current branch has the deprecated field driving_distance_file_path, while the more up to date version of this file (on main) has driving: True # bool (default = False).
- None of the folders with either of the fields that currently should live in other_args are in this branch (thus the need to see how you want that field implemented in config folders)
- I'm both concerned about the data in main not being over written by merges, and just to understand how one tells, moving forward, how one should keep track of which files have been modified, etc, and which still need work.
I'm a little worried that there will now need to be multiple places where config fields will need to be updated: definition of the schema and model_config.
- Is there a clever way for both of these to be pulling from the same single source of truth?
- If not, is there a way to set up testing so that one can check that both have the same fields (including the same spelling of said fields) to avoid duplication problems?

I realize that this is a lot. But I really appreciate your working with me on this. It is absolutely clear to me that you have done a phenomenal ammount of work, and I cannot thank you enough. This is going to be awesome.

I'll keep reviewing and asking questions. Just get back to me as and when you can.

@Amasus

back_fill_data.py vs the files in BigQuery_integration The R file is the first step (restructuring data), the python backfill_data.py is the second step (calling the functions to write to BigQuery). I wrote the R code when I was still grappling with what the structure of the database should be, and the code is a bit unnecessarily complex as a result. But it's throwaway code (we run as a one-off to backfill, then delete and never touch again) so no point in refactoring.

Trying to understand how you are setting up the "other_args" setup and how / if that affects how config files need to be made moving forward Nothing changes in the config files because of "other_args". The only change that this PR introduces to config files is the optional (but highly recommended!) config_name and config_set parameters.other_args is automatically generated when uploading data to BigQuery.

Specifically: if there are any config parameters that don't match with the "cannonical" set of columns that BigQuery is aware of, these config parameters get stored in JSON-like format in a column called other_args. So for example, if a config file has non-cannonical driving and fixed_capacity_site_number parameters, we convert these automatically into something like "{driving: True, fixed_capacity_site_number:47}" and store them in the other_args column. By design, we're only going to allow this in the test database. If you want to write to prod, you need to first add the column to BigQuery and to the config Python object.

On the subject of config files, I'm noticing that there are far fewer config folders in this branch than there are in main The config files in the branch follow what was in main on July 21, and BigQuery is currently backfilled with what existed in main around that data. Once we merge the branch, I can backfill new configs.

What worries me is if any outputs for a given config have changed since July 21. If we think outputs for the same config have changed, we may want to drop all existing BigQuery data and re-backfill in one fell swoop. The Engage_VA data configs will need their filepaths changed to follow the format I'm using (the backfill functions didn't like having more than one county in a folder).

I'd say let's treat the details of the backfill as a separate problem. We should make sure new data is getting written to BigQuery rather than CSVs, and then do a big final backfill once we know that there is no new data getting written to non-BigQuery sources.

I'm a little worried that there will now need to be multiple places where config fields will need to be updated: definition of the schema and model_config. Do we have a non-negligible number of instances where we rewrote the initial main branch config files and outputs for a given config name? In other words, have we frequently written a set of config and output files to main, then overwritten that? Do we need a scalable solution to this problem? If so, I think the likely solution is to drop all existing data from BigQuery, and do a single big backfill based on the most recent state of config and output files.

It sounds like there are two potential issues we may need to solve for: 1) Deprecated fields - I think the solution here should be to retain an exact record of the config as it was run, even if this uses now-deprecated fields. Assuming this is a fairly rare circumstance, I'd create a BigQuery column in prod for these deprecated fields, and just make a note that it's not used anymore. So if the output for a file was generated using driving_distance_file_path, we retain that column in the output file. 2) Renamed fields - I think we can manually handle renamed-but-identical fields, so they follow the current naming convention. However, moving forward I would have a strong bias against renaming existing fields.

PS - the comment about "W"'s was in fact left over from whatever tutorial I originally grabbed the code from. Dropped that, and also dropped a couple pycache files I accidentally committed.

@Amasus

back_fill_data.py vs the files in BigQuery_integration

Okay, that makes sense. Thank you fro clearing that up.

Trying to understand how you are setting up the "other_args" setup and how / if that affects how config files need to be made moving forward Nothing changes in the config files because of "other_args". The only change that this PR introduces to config files is the optional (but highly recommended!) config_name and config_set parameters.other_args is automatically generated when uploading data to BigQuery.

Okay, so just so I understand, right now, if driving and fixed_capacity_site_number appear in a future config file, then this automatically gets thrown into other_args and we don't write to SQL.

If, in November, I decide I would like to test out a different field, e.g. distance_metric then that would automatically be sent into other_args as well, correct?
- If I need to introduce a new field to the config, how should I deal with it in the file model_config.py?
What about penalized_sites, which is currently in model_config.py, but not in backfill_data.py? In model_config.py it looks like a regular field, and isn't commented about as part of other_args in the way that the other two you've mentioned are. What happens to that field?

Specifically: if there are any config parameters that don't match with the "cannonical" set of columns that BigQuery is aware of, these config parameters get stored in JSON-like format in a column called other_args. So for example, if a config file has non-cannonical driving and fixed_capacity_site_number parameters, we convert these automatically into something like "{driving: True, fixed_capacity_site_number:47}" and store them in the other_args column. By design, we're only going to allow this in the test database. If you want to write to prod, you need to first add the column to BigQuery and to the config Python object.

On the whole, this sounds reasonable. However, here is my issue with this approach:

A lot of the hassle you are seeing with the config files being changed, etc (that you rightfully call out below) is due to the fact that to date, this has been a manual process, and as such enforcing standards has been hard. In order to avoid these problems in the future, I am trying to figure out a way to automate the generation of config files. My immediate response is
- to have Soumyajit create a config generator that reads the fields directly from model_config.py (see single source of truth).
- This would mean that all subsequent config files would have entries of the form penalized_sites : [], driving : False, and fixed_capacity_site_number: None.
- However, this is an issue because, if I understand the workflow above correctly, none of these would ever make it to SQL
- Is there a way to set thing to write to SQL if certain defaults aren't met?

On the subject of config files, I'm noticing that there are far fewer config folders in this branch than there are in main

Will respond to this next. Want to spend some time comparing the July 21rst version of main to what is currently there for context.

Before I address the comment below, I want to flag a merge risk. Take a look at the files in Engage_VA_2024_driving_config/. Your branch has 8 files, main has 4. Your version contains two versions of each config, one with the driving flag and another with a driving path field. The second is deprecated.

I'm a little worried that there will now need to be multiple places where config fields will need to be updated: definition of the schema and model_config. Do we have a non-negligible number of instances where we rewrote the initial main branch config files and outputs for a given config name?

As far as I know, the following are the only exceptions, but it would be nice to do a systematic check.

Engage_VA_2024_diving_configs
1. Possibly other driving configs. I know that Daphne struggled with how this should be implemented, but I thought that this was fixed by mid May. (This is why I was so confused/ alarmed to see the deprecated driving flag in your branch)
A set of config fields where I'd realized after delivery that there were multiple changing parameters in the config file (something I'm trying to automate to guard against)
1. Compare_Savannah_City_of_GA_original_configs (this folder is now deprecated, but I haven't removed it, or its outputs from main)
2. Contained_in_*_original_configs
3. Intersecting_*_original_configs
4. Gwinnett_12_for_2024_configs
  1. This has been broken into two folders: Gwinnett_12_for_2024_configs and Gwinnett_12_for_2024_penalized_configs
A set of files where I'd found a typo (again, automatic generation of files in a set will help prevent this later)
1. *_Madison_City_of_WI_potential_configs
A set of files where there there is still a typo, but hasn't been addressed yet
1. *_Madison_City_of_WI_potential_configs_driving

In other words, have we frequently written a set of config and output files to main, then overwritten that? Do we need a scalable solution to this problem?

Oh please, I hope not. As you see from the example above, the rewrites all come from instances human error.

If so, I think the likely solution is to drop all existing data from BigQuery, and do a single big backfill based on the most recent state of config and output files.

I don't think this is going to be necessary.

It sounds like there are two potential issues we may need to solve for:

Deprecated fields - I think the solution here should be to retain an exact record of the config as it was run, even if this uses now-deprecated fields. Assuming this is a fairly rare circumstance, I'd create a BigQuery column in prod for these deprecated fields, and just make a note that it's not used anymore. So if the output for a file was generated using driving_distance_file_path, we retain that column in the output file.

I actually don't think it is necessary for the past work. I don't actually see corresponding runs in the appropriate *_result folders for the bad config files, so I think we can just ignore those config files.

More generally, I think it would be nice to have a systematic check of which config files have corresponding outputs, and which outputs have corresponding config files. Again, I've tried to stay on top of keeping things clean, but it hasn't always been possible, so there are definitely inconsitencies in the data. For instance, there isn't a folder DeKalb_bg_penalized_no_school_configs/, but there are results for said folder sitting around in DeKalb_GA_results. (I think Daphne has the configs, I'd have to check).

Renamed fields - I think we can manually handle renamed-but-identical fields, so they follow the current naming convention. However, moving forward I would have a strong bias against renaming existing fields.

In an ideal world, I would agree with you completely. However ...

Here are a few changes I can definitely seeing coming up in the near future:

driving -> distance_metric as we try to incorporate multiple different distance measures into the model
bad_types -> good types: this was a bad design decision on my part, and I would eventually like to not use bad_types, and only use good types. This may not require renaming, just deprecating of an old field and incorporating a new one

Again, both of these are because of design decisions made early on/ need to grow the model in ways that we didn't forsee when we started out.

I REALLY want to have a stable and rigid data structure. I absolutely believe that this is the right way to do things. Ideally, I'd have a stable and rigid data structure, and an engineer on hand to tweak it every few months as the model changes. but I don't have that right now. As a result, I'm trying to figure out what the right middle ground should be.

I'd love to hear thoughts.

Looking through model_results.py, and have a few thoughts.

Do I recall correctly that you are producing a view that has the most recently run data associated to a config file, right (including the config file itself?)
If so, here's what I'm wondering can be done.
1. Add runtime to all of the output files (and possibly the config?) as well as config_name and config set. Similarly, for the *_locations_only.csv
2. Not overwrite data, but always append if the data is to be written
3. Need to be a little careful in this case about not "accidentally" adding data to a table by setting the wrong flag. Maybe the right solution is to run some sort of back_fill_data when merging to main, but I know there are issues here as well. I'd like to discuss

Why do I want this?

True story. I'd run some DeKalb analysis for 2022. Then Fair Fight Action asked me to run another set of analysis on DeKalb for 2024, which required me to change the DeKalb_GA_locations_only.csv file. Four months later, Daphne needs access to the old DeKalb_GA_locations_only.csv. Because this is Git, she is able to go back and find the right version, because ... yay version controll. I'm hoping that having a runtime column to all datasets will allow for a similar feature. Please tell me if you think that there is another way to get around this.
Since there is currently not a good way to store the graphs (which are completely determined by the data being stored), this would allow one to go back and recreate graphs that were created by a different source of the data.

Does this make sense? Is the need for this usecase completely broken everything? I know I've been saying that I DO NOT want to accidentally change the output data, and that is still true. However, I may very well want to change things like the *_locations_only.csv files or even the configs, and be able to go back and find old versions.

Tackling three things in this comment: 1) a general principle, 2) how other_args will mesh with the automatic config generator, and 3) using config name and set as a unique key.

General Principles

I think there are a lot of productive architecture conversations that we should have and that this PR is catalyzing. However, as a practical matter, this PR is already a beast because it diverges a lot from main. I want to merge it as quickly as possible, move new data to BigQuery, and then build on top of it and backfill existing data. The longer we wait to merge, the more debt we add and the harder PR becomes.

So I'd focus on questions:

Is anything in this PR going to block existing delivery in the next say 4-6 weeks?
Is anything in this PR going to lead to bad admin practices (the answer here is "yes" - we have duplicate config files and outputs that we should clear up, or it will be a massive headache later)?
Are there any hard-to-reverse design decisions that we need to make now (I think the answer here is "no" but we should have a clear consensus here)

And then dump everything else into a GDoc or some similar format where we can continue to riff on it.

Matching Other_Args with the Config Generator

a config generator that reads the fields directly from model_config.py (see single source of truth)...would mean that all subsequent config files would have entries of the form penalized_sites : [], driving : False, and fixed_capacity_site_number: None.However, this is an issue because, if I understand the workflow above correctly, none of these would ever make it to SQL

I think we're okay here! When we add a new class attribute to model_config, this doesn't automatically add it to the list of cannonical arguments. We're_ manually defining cannonical_args in a list in the code. As a best practice, I lean towards thinking that cannonical_args should match the class attributes in the main branch - but adding new class attributes won't automatically update cannonical args

Slightly simplified version of the code. I've renamed the variable and edited comments to make this a little clearer and pushed a new commit here.


@dataclass
class PollingModelConfig:

  # Define class attributes
  driving: bool = False
  capacity: float = 1.0
  minpctold: float = 0

  @staticmethod
  def load_config(config_yaml_path: str) -> 'PollingModelConfig':

    # Define cannonical args - note that the class attribute for 'driving' isn't included in this list
    cannonical_args =  ['capacity', 'min_pct_old']
    with open(config_yaml_path, 'r', encoding='utf-8') as yaml_file:
        config = yaml.safe_load(yaml_file)

        # Look for args that don't match the cannonical_args list, and dump them into an other_args dict
        for key, value in config.items():
            if key not in cannonical_args:
                other_args[key] = value

Using Config Name and Set As a Unique Key

Do I recall correctly that you are producing a view that has the most recently run data associated to a config file, right (including the config file itself?)

The design decision we made a couple months ago was to use (config name * config set) to uniquely identify a set of outputs. By design, if there are already outputs for config_set = 'DeKalb_GA_locations_only' and config_name = 'Dekalb_GA_locations_only_2022', we can't add more. By design, writing to BigQuery will fail if replace = False (the default) or overwrite existing data if replace = True.

I think this design decision was right for now, but not a long term solution. If we're re-running analysis on the same set of counties, we should imho strongly encourage the creation a new config set, with a new name. (config set config name) should for most practical purposes act like a unique key. That said, for a variety of reasons, I do think we should eventually create a hash of (config set config name * runtime) and use this as the actual unique key. There are a variety of benefits to this, one of which is exactly what you mention - it could allow us to 'hide' old versions of runs rather than actually deleting them. I think there are a few complexities to think through here (retention periods for both test and prod, etc) but basically the use case you outline makes sense and is something we can sole for.

This more sophisticated data model is a next step, accomplishable I think in the next few months.

Note, in Engage_VA_2024_driving_configs, keep the files with original, delete the other ones.

@antisocialscientist , can you review the R code for this when you get a chance?

Prior version read everything from config files. The code chose to repeatedly pull from file as needed instead of saving data in local memory.

Current version reads either from config or database, and write the config files to local memory to avoid multiple calls to the database.

Voting-Rights-Code / Equitable-Polling-Locations

BigQuery integration #53

53

General Principles

Matching Other_Args with the Config Generator

Using Config Name and Set As a Unique Key