duckdb / postgres_scanner

https://duckdb.org/docs/extensions/postgres
MIT License
227 stars 36 forks source link

Fix copy to postgres for lists and structs in text mode. #218

Closed mima-hlavacek closed 5 months ago

mima-hlavacek commented 5 months ago

Sorry for the uncalled for PR, but I had a patch we were using, so I figured I could offer it. Bare in mind that I'm not a C++ person, so it might be completely misguided :D.

This PR changes the way how lists and structs are copied to postgres when the text mode is active. Previously, no escaping was performed on the members of lists and structs. This meant that whenever we were trying to copy a value which contained a character used to encode either structs or lists, we'd end up with invalid literals. This made it impossible to copy:

  1. Lists containing the empty string
  2. Lists containing whitespace-only string
  3. Lists containing strings containing curly braces, commas or quotes
  4. Structs containing strings containing braces, commas or quotes
  5. Combinations of all of the above

Consider the following (cursed!) motivational example

CREATE TYPE my_inner_composite_type as (
    id INT,
    names varchar[]
);

CREATE TYPE my_composite_type AS (
    id INT,
    children my_inner_composite_type[]
);

select (
    10,
    array[
        array[
            (
                20,
                array['a', 'b']::varchar[]
            ),
            (
                30,
                null
            )
        ],
        array[
            (
                40,
                array[
                    array['}'],
                    array['"'],
                    array[null]
                ]::varchar[]
            ),
            (
                50,
                array[
                    array[''],
                    array['a'],
                    array['(']
                ]::varchar[]
            )
        ]
    ]::my_inner_composite_type[]
)::my_composite_type
as "Postgres what are you doing?"
;
           Postgres what are you doing?
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 (10,"{{""(20,\\""{a,b}\\"")"",""(30,)""},{""(40,\\""{{\\""\\""}\\""\\""},{\\""\\""\\\\\\\\\\""\\""\\""\\""},{NULL}}\\"")"",""(50,\\""{{\\""\\""\\""\\""},{a},{(}}\\"")""}}")
(1 row)

Turns out postgres literals are relatively simple if thought about this way:

  1. You encode lists as {1, 2, 3}
  2. You encode structs as (1, 2, 3)
  3. You can quote all problematic members of collections inside double quotes.
    • e.g. {a, "a,a,a,", b} contains 3 elements: a, a,a,a, and b
  4. you have to escape the " character inside quoted strings using a backshlash \", and escape backslashes using \\
    • e.g. {a, "b\"c"} contains two elements: a and b"c.
  5. Rules 1.-4. can be recursively applied to nested collections
    • Say we want to produce literal for [{'id': 10, 'children': ['a', '', '}', 'b']}, null]
    • If we first deal with the children list, we get [{'id': 10, 'children': {a, "", "}", b}}, null]
    • Now, to serialize the struct we need to quote the child array so that we're not disturbed by the excessive commas. We can simply take the serialized value, surround it with quotes and escape the quotes inside it to get [(10, "{a, \"\", \"}\", b}"), null]
    • Finally, we need to deal with the outer list. Again, we can take the serialized values as they are, quote them if needed and apply escaping to \ and " inside. We end up with {"(10, \"{a, \\\"\\\", \\\"}\\\", b}\")", null}

This PR pretty much follows the worked example from 5. Instead of tracking the recursion depth during serialization, the methods responsible for serializing structs and lists perform their own escaping. To combat the exponential growth of escaping backslashes a detection method is included which checks if the value needs to be quoted (= is empty or contains problematic characters).

Mytherin commented 5 months ago

Thanks!