Open lauriemerrell opened 2 years ago
@atvaccaro re: https://github.com/cal-itp/data-infra/pull/1218#pullrequestreview-908938129
Right now to_snakecase
does different replacement behavior for quotes ('
, "
, replaced by nothing) than for other non-alphanumeric characters (replaced by -
). So I think that's at least part of the reason that it's hard-coded and the reason that I think we need to investigate how this is actually being used before we make any of these broader changes.
I guess we could replace the quotes first and then replace anything non-alphanumeric that remains with _
.
If/once we do this, we can revert https://github.com/cal-itp/data-infra/pull/1218 and maybe remove the special handling for #
signs in that table too.
Right now
to_snakecase
does different replacement behavior for quotes ('
,"
, replaced by nothing) than for other non-alphanumeric characters (replaced by-
). So I think that's at least part of the reason that it's hard-coded and the reason that I think we need to investigate how this is actually being used before we make any of these broader changes.
Do we know the original reasoning for different replacement characters?
I guess we could replace the quotes first and then replace anything non-alphanumeric that remains with
_
.
That's a good idea.
Do we know the original reasoning for different replacement characters?
I don't, but I suspect if we look more at the different places where it's used it might become clearer. I believe it's used in the CSVToWarehouse
operator so suspect that may be part of it.
BigQuery cannot load column names that contain colons (
:
). We ran into this with theAirtableToWarehouse
operator:google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/cal-itp-data-infra/datasets/airtable/tables?prettyPrint=false: Invalid field name "product:_graas". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 300 characters long.
The column names in that operator are pre-processed with
to_snakecase
, but the list of punctuation to replace into_snakecase
is manually specified and does not include colons: https://github.com/cal-itp/calitp-py/blob/main/calitp/sql.py#L153We should add colons to this list.
Things to check:
to_snakecase
? It would be desirable for Airtable but I'm not sure about other locations