GreenmaskIO / greenmask

PostgreSQL database anonymization and synthetic data generation tool
https://greenmask.io
Apache License 2.0
1.07k stars 18 forks source link

Dump not applying transformation with no error emitted #176

Closed wnatrifork closed 2 months ago

wnatrifork commented 2 months ago

It seems possible to encounter a case when despite a transformation is defined it is not applied and no error is emitted. Shouldn't user be warned if a transformer is not being applied?

greenmask_config.yml file:

dump:
  pg_dump_options:
    dbname: host=localhost user=wna dbname=bireporting
    load-via-partition-root: true
storage:
  type: directory
  directory:
    path: /Users/wna/Developer/
  transformation:
    - schema: public
      name: backlog_entry_value
      transformers:
        - name: RandomFloat
          params:
            column: "value"
            min: 0.0
            max: 10.0
            precision: 2
./greenmask dump --config=greenmask_config.yml
./greenmask list-dumps --config=greenmask_config.yml
+---------------+---------------------------+-------------+----------+-----------------+----------+-------------+-------------------+
|      ID       |           DATE            |  DATABASE   |   SIZE   | COMPRESSED SIZE | DURATION | TRANSFORMED |      STATUS       |
+---------------+---------------------------+-------------+----------+-----------------+----------+-------------+-------------------+
| 1724488905081 | 2024-08-24T10:41:45+02:00 | bireporting | 26.5 MiB | 8.5 MiB         | 00:00:00 | false       | done              |
+---------------+---------------------------+-------------+----------+-----------------+----------+-------------+-------------------+

A part of metadata.json file. Notice transformers is null.

{
    "startedAt": "2024-08-24T10:41:45.081384+02:00",
    "completedAt": "2024-08-24T10:41:45.816925+02:00",
    "originalSize": 27828141,
    "compressedSize": 8912949,
    "transformers": null,
    "database_schema":
    [
        {
            "schema": "public",
            "name": "backlog_entry_value",
            "oid": 16409,
            "columns":
            [
                {
                    "idx": 2,
                    "name": "value",
                    "type_name": "numeric(10,2)",
                    "canonical_type_name": "numeric",
                    "type_oid": 1700,
                    "num": 3,
                    "not_null": false,
                    "length": 655366,
                    "type_length": -1,
                    "is_generated": false,
                    "overridden_type_name": "",
                    "overridden_type_oid": 0,
                    "overridden_type_size": 0
                },
    [...]

I was not able to find the root cause of the issue in reasonable time.

wwoytenko commented 2 months ago

Hi! Thank you for reporting. You've set the transformation in the wrong object (in storage) instead of dump. It should be this way

dump:
  pg_dump_options:
    dbname: host=localhost user=wna dbname=bireporting
    load-via-partition-root: true
  transformation:
    - schema: public
      name: backlog_entry_value
      transformers:
        - name: RandomFloat
          params:
            column: "value"
            min: 0.0
            max: 10.0
            precision: 2
storage:
  type: directory
  directory:
    path: /Users/wna/Developer/

Try it

wwoytenko commented 2 months ago

However, I agree that it would be helpful to report this issue. I'll consider how to fix it.

wnatrifork commented 2 months ago

Thank you! Such a silly mistake.

Does this mean numeric(precision,scale) type is not supported?

2024-08-24T11:20:11+02:00 ERR ValidationWarning={"hash":"","meta":{"AllowedTypes":["float4","float8"],"ColumnName":"value","ParameterName":"column","SchemaName":"public","TableName":"backlog_entry_value","TransformerName":"RandomFloat","TypeName":"numeric(10,2)"},"msg":"unsupported column type","severity":"error"}
2024-08-24T11:20:11+02:00 FTL cannot make a backup error="context error: fatal validation error"
wwoytenko commented 2 months ago

Yes, it does but only the v0.2b1 version or the current development version in the main. For v0.2b1 version and current main you can use RandomNumeric transformer that has wider configuration options such as decimal.

For the latest version 1.14 you can try to override the type of the table in the config using the next

transformation:
    - schema: public
      name: backlog_entry_value
      columns_type_override:
        value: "numeric" 
      transformers:
        - name: RandomFloat
          params:
            column: "value"
            min: 0.0
            max: 10.0
            precision: 2

The details are here

wwoytenko commented 2 months ago

@wnatrifork FYI ^

wwoytenko commented 2 months ago

I've added a fix in MR #177. So in a new release if the unknown option was provided it will be reported.

wnatrifork commented 2 months ago

Thank you. I using a compiled version of the tool from main branch:

wna@Wojciechs-MBP: greenmask % ./greenmask --version
greenmask version a2ecc2cdae5af10e9443ccdab61987f31bd3ade6 2024-08-20T11:49:00Z

I have used random numeric transformer as advised:

  transformation:
    - schema: public
      name: backlog_entry_value
      transformers:
        - name: RandomNumeric
          params:
            column: "value"
            min: 0.0
            max: 10.0
            decimal: 2

and received following error when trying to restore:

2024-08-24T12:15:57+02:00 WRN unable to restore table error="error post streaming handling: error from postgres connection: numeric field overflow A field with precision 10, scale 2 must round to an absolute value less than 10^8. (code 22003)"

I converted the dump from directory format to human readable SQL and found:

--
-- Data for Name: "backlog_entry_value"; Type: TABLE DATA; Schema: "public"; Owner: "wna"
--

COPY "public"."backlog_entry_value" ("unique_id", "created", "value", "author_name", "author_email", "backlog_entry_id") FROM stdin
dc673f92-404d-4873-ba36-779da5947748    2020-05-25 08:22:27.179153  -55142080548242225025.74    <redacted_username> <redacted_email>    33cd6668-6dd7-4059-95a1-e5ae03027c04

value is not kept in <min, max> bounds. This seems to be a bug.

wwoytenko commented 2 months ago

@wnatrifork The MR #177 is merged. I hope this will fix this implicit issue

wwoytenko commented 2 months ago

Thank you for reporting. I will check shortly

wwoytenko commented 2 months ago

@wnatrifork That's definitely a bug. It will be fixed today/tomorrow.

wwoytenko commented 2 months ago

@wnatrifork MR #178 should close your issue

wwoytenko commented 2 months ago

@wnatrifork try now and report please

wnatrifork commented 2 months ago

I confirm RandomNumeric transformer keeps value within defined bounds. Thank you