Qovery / Replibyte

Seed your development database with real data ⚡️
https://www.replibyte.com
GNU General Public License v3.0
4.16k stars 128 forks source link

unterminated string literal #232

Open gumaerc opened 2 years ago

gumaerc commented 2 years ago

Hi there. I'm attempting to use Replibyte to set up replication of our production database to our test environment. The database in question is attached to a custom built CMS web application and the bulk of the data are text fields that store markdown. Running a replibyte dump against this database produces the following error output:

selected worker: ccb50a8c4be4
⠒ [00:02:22] [####################################################################################################################################################################################################################################################################################################################################################################################################################################################>-----------------------] 94.92MiB/100.00MiB (8s)
failing query: '
INSERT INTO public.websites_websitecontent (id, created_on, updated_on, text_id, title, type, markdown, metadata, parent_id, website_id, owner_id, file, updated_by_id, is_page_content, deleted, filename, dirpath) VALUES (440899, '2021-09-02 20:16:09.808962+00', '2022-06-15 17:17:15.404312+00', '8c753cd8-ca5b-bd65-9763-8c3695024fca', 'X86-64 Architecture Guide', 'page', 'For the code-generation project, we expect your compiler to produce simple assembly code. We shall expose you to a subset of the x86-64 platform.

## Example

Consider the following Decaf program:

```plaintext
class Program {
    int foo(int x) {
        return x + 3;
    }
    void main() {
        int y;
        y = foo(callout("get_int_035"));
        if (y == 15) {
            callout("printf", "Indeed! \''tis 15!\n");'
thread 'main' panicked at 'TokenizerError { message: "Unterminated string literal", line: 2, col: 82 }', dump-parser/src/postgres/mod.rs:824:13
stack backtrace:
   0:     0x7f43255cdf8d - std::backtrace_rs::backtrace::libunwind::trace::h081201764674ef17
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x7f43255cdf8d - std::backtrace_rs::backtrace::trace_unsynchronized::hebab37398c391bd7
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7f43255cdf8d - std::sys_common::backtrace::_print_fmt::h301516df68ed24f9
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:66:5
   3:     0x7f43255cdf8d - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h8f5170f4f03a12c0
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:45:22
   4:     0x7f4325619a6c - core::fmt::write::h5dc5601e8d9f6367
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/fmt/mod.rs:1190:17
   5:     0x7f43255c5bf8 - std::io::Write::write_fmt::h5b19302eb99d9acf
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/io/mod.rs:1657:15
   6:     0x7f43255d0497 - std::sys_common::backtrace::_print::hd81cf53a75c8ae6a
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:48:5
   7:     0x7f43255d0497 - std::sys_common::backtrace::print::hb5aa882e87c2a0dc
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:35:9
   8:     0x7f43255d0497 - std::panicking::default_hook::{{closure}}::had913369af61b326
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:295:22
   9:     0x7f43255d0160 - std::panicking::default_hook::h37b06af9ee965447
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:314:9
  10:     0x7f43255d0be9 - std::panicking::rust_panic_with_hook::hf2019958d21362cc
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:698:17
  11:     0x7f43255d08d7 - std::panicking::begin_panic_handler::{{closure}}::he9c06fdd592f8785
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:588:13
  12:     0x7f43255ce454 - std::sys_common::backtrace::__rust_end_short_backtrace::ha521b96560789310
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/sys_common/backtrace.rs:138:18
  13:     0x7f43255d05e9 - rust_begin_unwind
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:584:5
  14:     0x7f4324737b93 - core::panicking::panic_fmt::h28f1697d4e9394b4
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/panicking.rs:143:14
  15:     0x7f43250442ae - dump_parser::postgres::get_tokens_from_query_str::h65e24e4f8cbc8f6b
  16:     0x7f4324812577 - replibyte::source::postgres::read_and_transform::{{closure}}::h21becf577a0258c6
  17:     0x7f432480e296 - dump_parser::utils::list_sql_queries_from_dump_reader::ha5c4f9f018ef48d9
  18:     0x7f43247dd7d7 - <replibyte::tasks::full_dump::FullDumpTask<S> as replibyte::tasks::Task>::run::hbe436c3f9a64d91e
  19:     0x7f4324786b26 - replibyte::commands::dump::run::h2c7a4f7f319ea54d
  20:     0x7f432484959c - replibyte::main::hf5aea629140e0d3d
  21:     0x7f432477b433 - std::sys_common::backtrace::__rust_begin_short_backtrace::hd939a912bfcc1f10
  22:     0x7f4324801ad9 - std::rt::lang_start::{{closure}}::h7bad7a965e7c4d15
  23:     0x7f43255cd6e4 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hd127f27863548251
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ops/function.rs:259:13
  24:     0x7f43255cd6e4 - std::panicking::try::do_call::h926290883a1d024e
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:492:40
  25:     0x7f43255cd6e4 - std::panicking::try::hc74a3d1f4a4b6e5f
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:456:19
  26:     0x7f43255cd6e4 - std::panic::catch_unwind::h5eb7ded2df1a4d5f
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panic.rs:137:14
  27:     0x7f43255cd6e4 - std::rt::lang_start_internal::{{closure}}::h0736f9682f7c55ea
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/rt.rs:128:48
  28:     0x7f43255cd6e4 - std::panicking::try::do_call::h2772c479b1c89ef7
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:492:40
  29:     0x7f43255cd6e4 - std::panicking::try::h967ebbc371287391
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:456:19
  30:     0x7f43255cd6e4 - std::panic::catch_unwind::h41bcc02b28316856
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panic.rs:137:14
  31:     0x7f43255cd6e4 - std::rt::lang_start_internal::haf46799f55774d07
                               at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/rt.rs:128:20
  32:     0x7f4324874b22 - main

This data is a markdown code block with a Java example in it. The offending line contains callout("printf", "Indeed! \'tis 15!\n");. When I remove the backslash before the single quote and save the data, the Replibyte dump completes successfully. This is valid text data for a text field, so I'm not sure why this would be failing. Any ideas?

evoxmusic commented 2 years ago

Hi @gumaerc , can you provide me the complete query that makes failing Replibyte? Then I can reproduce it and provide a fix on the parser. Thank you 🙏🏽

gumaerc commented 2 years ago

@evoxmusic I'll have to do some setup to reproduce this as we've pivoted to simply using pg_dump / pg_restore for the time being. We run jobs like this in an instance of Concourse, which is a CI/CD runner like Github actions that runs everything in Docker containers that it throws away after the build so I don't have any saved logs.

I'll make another issue for this potentially, but another reason we ultimately decided to abandon using Replibyte is that we ran into a severe memory leak issue. I was testing all of this using a locally running instance of Concourse. I did some manual adjustment of the data to remove the escape characters that were tripping up Replibyte, and when I ran the dump again it would get to the second chunk of 100MB and then RAM usage would skyrocket and consume all 32GB of RAM in under 1 minute. I'm not sure how best to provide debug output for that scenario. I can maybe look into getting a copy of our database dump with PII stripped out for debugging. We're still interested in possibly using Replibyte if we can as we're interested in being able to hook in and subset the data for local development, but these issues got in the way unfortunately.

gugacavalieri commented 2 years ago

This happened to me as well, so I had to do a little search in the DB to find bad data but it would be awesome if we could fix this to have the parser working correctly.

I can provide some test inputs (Might be related to this problem which involves single quotes in the database and backslashes):

SQL Dump (This is a valid output from pg_dump):

CREATE TABLE public."Attachments" (
    id uuid NOT NULL,
    text character varying(255) NOT NULL,
);

INSERT INTO public."Attachments" (id, text) VALUES ('8e6c5d2b-0b93-4152-a1d2-2f339ae16aab', 'this should not break');
INSERT INTO public."Attachments" (id, text) VALUES ('8e6c5d2b-0b93-4152-a1d2-2f339ae16aac', 'this should break if we escape a quote like this child\''s');

Replibyte config file:

source:
  transformers:
    - database: public
      table: Attachments
      columns:
        - name: url
          transformer_name: random

datastore:
  local_disk:
    dir: ./storage

Then if I run:

cat test.sql | replibyte -c test-replibyte.yaml dump create -i -s postgresql

I get the output:

Dump created successfully!

But the sanitized file doesn't have the last row:

replibyte -c test-replibyte.yaml dump restore local -i postgres -v latest -o > sanitized-dump.sql
image

So, I think the parser is not working properly when we have single quotes escaped with backslashes: \'. When I removed the backslash from the database everything started working correctly.

gugacavalieri commented 2 years ago

Input where I can reproduce the error from this issue (Also came from pg_dump but I have changed some values):

CREATE TABLE public."Attachments" (
    id uuid NOT NULL,
    text character varying(255) NOT NULL,
);

INSERT INTO public."Attachments" (id, text) VALUES (1, 'this should not break');

INSERT INTO public."Attachments" (id, text) VALUES (2, 'Chris and his beautiful bride Laura.\');

INSERT INTO public."Attachments" (id, text) VALUES (3, 'Whatever! :)');

INSERT INTO public."Attachments" (id, text) VALUES (4, 'Wow what a game ;)');
cat test.sql | env RUST_BACKTRACE=1 replibyte -c test-replibyte.yaml dump create -i -s postgresql
failing query: 'INSERT INTO public."Attachments" (id, text) VALUES (3552, 'Chris and his beautiful bride Laura.\');

INSERT INTO public."Attachments" (id, text) VALUES (3588, 'Whatever! :)');

INSERT INTO public."Attachments" (id, text) VALUES (3598, 'Wow what a game ;'
thread 'main' panicked at 'TokenizerError { message: "Unterminated string literal", line: 5, col: 22 }', dump-parser/src/postgres/mod.rs:790:13
stack backtrace:
⠁ 
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: dump_parser::postgres::get_tokens_from_query_str
   3: replibyte::source::postgres::read_and_transform::{{closure}}
   4: dump_parser::utils::list_sql_queries_from_dump_reader
   5: replibyte::source::postgres::read_and_transform
   6: <replibyte::source::postgres_stdin::PostgresStdin as replibyte::source::Source>::read
   7: <replibyte::tasks::full_dump::FullDumpTask<S> as replibyte::tasks::Task>::run
   8: replibyte::commands::dump::run
   9: replibyte::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

If I remove the \ or the ; or the :) from the input it works correctly ...

michalkutrzeba-odrabiamy commented 1 year ago

Hey guys looks like replibyte has issue with escaping apostrophe, I've made fix for that, maybe it fixes also your problems.

https://github.com/Qovery/Replibyte/pull/259