MeltanoLabs / target-postgres

MIT License
11 stars 20 forks source link

feat: As a last resort, try inserting TEXT into a UUID column #387

Closed grigi closed 4 months ago

grigi commented 4 months ago

Have some data that is of type UUID where the source treats it as text.

Instead of failing with an exception:

NotImplementedError: Altering columns is not supported. Could not convert column 'public.users.uuid' from 'UUID' to 'TEXT'.

Just treat them as compatible.

I couldn't think of a different way to do this, as even though the Postgres integration supports UUID, not all do. And technically UUID isn't TEXT (just that it's represented as text in sql, so the DB handles it transparently), so it shouldn't be part of merge_sql_types.

edgarrmondragon commented 4 months ago

Hi @grigi, thanks for the PR!

Could you work around this by overriding the schema to something like:

plugins:
  extractors:
  - name: tap-example
    schema:
      my_stream:
        my_uuid_column:
          type: ["string", "null"]
          format: uuid

The UUID format is already handled:

https://github.com/MeltanoLabs/target-postgres/blob/fd881a638fd39649f570ee9d24e6db3531b741e2/target_postgres/connector.py#L290

nb This is if you're using Meltano, otherwise you could pass a custom catalog file.

grigi commented 4 months ago

Thanks for the rapid response. I've tried that (using meltano, yes) but I can't get it to work. It updates the schema generated in the catalog cache .meltano/run/tap-mysql/tap.properties.json, but when I invoke the extractor it overrides the schema in the output.

Manually updating the discovered catalog and providing it, it also gets ignored. I'm not sure if this is a bug in meltano or pipelinewise-tap-mysql.

There's a high chance that I'm missing something :shrug:

edgarrmondragon commented 4 months ago

I think the change here can be problematic for the general use case, since a non-uuid text value can really break pipelines if the target column is UUID.

That said, we could move the add a new user setting to check right before https://github.com/MeltanoLabs/target-postgres/blob/fd881a638fd39649f570ee9d24e6db3531b741e2/target_postgres/connector.py#L431-L438

if we should even attempt to adapt column types to the schema. That way, we'd skip this problem entirely if the user considers mismatches are acceptable.

Something like check_column_types or similar, with a description like Fail the sync if the incoming stream schema doesn't match the type of an existing column.

Wdyt?

cc @visch


That said, the fact that the tap doesn't support schema overrides is bad so I dug a bit and submitted a PR https://github.com/transferwise/pipelinewise-tap-mysql/pull/186.

I did test it a bit but let's hope I didn't do something terrible and the Wise folks accept it 😅.

In the meantime it can be tested by updating the pip_url. This is the project I validated these changes with:

plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url: "'pipelinewise-tap-mysql @ git+https://github.com/edgarrmondragon/pipelinewise-tap-mysql.git@allow-schema-overrides'"
    config:
      user: root
      host: 'localhost'
      database: mysql
      filter_dbs: mysql
    select:
    - "mysql-tests.*"
    schema:
      mysql-tests:
        my_uuid:
          type: [string, "null"]
          format: uuid
          inclusion: available
grigi commented 4 months ago

Ooh thanks for that. That PR makes it all work for me :-) It's certainly the more correct solution.

visch commented 4 months ago

I think the change here can be problematic for the general use case, since a non-uuid text value can really break pipelines if the target column is UUID.

That said, we could move the add a new user setting to check right before

https://github.com/MeltanoLabs/target-postgres/blob/fd881a638fd39649f570ee9d24e6db3531b741e2/target_postgres/connector.py#L431-L438

if we should even attempt to adapt column types to the schema. That way, we'd skip this problem entirely if the user considers mismatches are acceptable.

Something like check_column_types or similar, with a description like Fail the sync if the incoming stream schema doesn't match the type of an existing column.

Wdyt?

cc @visch

That said, the fact that the tap doesn't support schema overrides is bad so I dug a bit and submitted a PR transferwise/pipelinewise-tap-mysql#186.

I did test it a bit but let's hope I didn't do something terrible and the Wise folks accept it 😅.

In the meantime it can be tested by updating the pip_url. This is the project I validated these changes with:

plugins:
  extractors:
  - name: tap-mysql
    variant: transferwise
    pip_url: "'pipelinewise-tap-mysql @ git+https://github.com/edgarrmondragon/pipelinewise-tap-mysql.git@allow-schema-overrides'"
    config:
      user: root
      host: 'localhost'
      database: mysql
      filter_dbs: mysql
    select:
    - "mysql-tests.*"
    schema:
      mysql-tests:
        my_uuid:
          type: [string, "null"]
          format: uuid
          inclusion: available

That seems like a reasonable setting idea! Gives folks a way to allow postgres to just figure it out for you which will work in a lot of cases.

I think there's a whole flow with the other targets around different data types https://www.stitchdata.com/docs/replication/loading/understanding-table-structural-changes#columns-mixed-data-types that we haven't gone down but we could.

I kind of like the current method as it's explicit and instead of making columns you don't expect it just fails and then you have to create a solution (we could do a much better job explaining what's going on when it fails for this reason) but it all depends on the use case people have.

grigi commented 4 months ago

I see your PR got closed without comment. That's sad. But thanks for it. I'm going to keep on using your branch until we decommission our last MySQL servers. (Hopefully by end of the year).

So, considering that, I feel that this PR can be closed.

Thanks for your assistance!