cal-itp / calitp-py

Tools for accessing and analyzing cal-itp data
https://docs.calitp.org/calitp-py
GNU General Public License v3.0
1 stars 0 forks source link

calitp-py: to_snakecase should substitute colons #41

Open lauriemerrell opened 2 years ago

lauriemerrell commented 2 years ago

BigQuery cannot load column names that contain colons (:). We ran into this with the AirtableToWarehouse operator:

google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/cal-itp-data-infra/datasets/airtable/tables?prettyPrint=false: Invalid field name "product:_graas". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 300 characters long.

The column names in that operator are pre-processed with to_snakecase, but the list of punctuation to replace in to_snakecase is manually specified and does not include colons: https://github.com/cal-itp/calitp-py/blob/main/calitp/sql.py#L153

We should add colons to this list.

Things to check:

lauriemerrell commented 2 years ago

@atvaccaro re: https://github.com/cal-itp/data-infra/pull/1218#pullrequestreview-908938129

Right now to_snakecase does different replacement behavior for quotes (', ", replaced by nothing) than for other non-alphanumeric characters (replaced by -). So I think that's at least part of the reason that it's hard-coded and the reason that I think we need to investigate how this is actually being used before we make any of these broader changes.

I guess we could replace the quotes first and then replace anything non-alphanumeric that remains with _.

lauriemerrell commented 2 years ago

If/once we do this, we can revert https://github.com/cal-itp/data-infra/pull/1218 and maybe remove the special handling for # signs in that table too.

atvaccaro commented 2 years ago

Right now to_snakecase does different replacement behavior for quotes (', ", replaced by nothing) than for other non-alphanumeric characters (replaced by -). So I think that's at least part of the reason that it's hard-coded and the reason that I think we need to investigate how this is actually being used before we make any of these broader changes.

Do we know the original reasoning for different replacement characters?

I guess we could replace the quotes first and then replace anything non-alphanumeric that remains with _.

That's a good idea.

lauriemerrell commented 2 years ago

Do we know the original reasoning for different replacement characters?

I don't, but I suspect if we look more at the different places where it's used it might become clearer. I believe it's used in the CSVToWarehouse operator so suspect that may be part of it.