dimitri / pgloader

Migrate to PostgreSQL in a single command!
http://pgloader.io
Other
5.36k stars 542 forks source link

Common Lisp trivial-utf-8 error: Invalid byte at start of character: 0xFC #1562

Open JosefMachytkaNetApp opened 7 months ago

JosefMachytkaNetApp commented 7 months ago

Hi guys, I have encounter a problem which seems to come from trivial-utl-8 lisp library. At least I cannot find this error message anywhere in pgloader code or in PostgreSQL code. Only in this lisp library - https://github.com/fukamachi/trivial-utf-8/blob/master/trivial-utf-8.lisp#L110 - but this code seems to be abandoned for 13 years.

I am using pgloader to copy data in parallel from foreign tables created by tds_fdw extension, into target PostgreSQL tables. In PostgreSQL all works as expected. I can select data from FDW table, I can do INSERT INTO target_table SELECT * FROM fdw_table, all works. But when I start pgloader to copy data from all tables, it fails on one particular table with an error message - "A thread failed with error: Invalid byte at start of character: 0xFC" - see details below.

Underlying data in source database are encoded in UTF-16. Tds_fdw and PostgreSQL process data without problems. But looks like lisp library fails. Maybe someone would have some idea what to do with it?

Thank you very much.


2024-02-23T08:34:24.097092Z ERROR A thread failed with error: Invalid byte at start of character: 0xFC
2024-02-23T08:34:24.101092Z ERROR Date/time: 2024-02-23-09:34!
An unhandled error condition has been signalled:
   Invalid byte at start of character: 0xFC

Backtrace for: #<SB-THREAD:THREAD "lparallel" RUNNING {10059F0073}>
0: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE-TO-STREAM #<SB-IMPL::STRING-OUTPUT-STREAM {1008493303}>)
1: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {10084930F3}> :OUTPUT NIL :IF-EXISTS :APPEND :VERBOSE NIL)
2: ((LAMBDA (CONDITION) :IN PGLOADER.LOAD:COPY-FROM) #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {10084930F3}>)
3: (LPARALLEL.KERNEL::CONDITION-HANDLER #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {10084930F3}>)
4: (SB-KERNEL::%SIGNAL #<CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR {10084930F3}>)
5: (ERROR CL-POSTGRES-TRIVIAL-UTF-8:UTF-8-DECODING-ERROR :BYTE 252 :MESSAGE "Invalid byte at start of character: 0x~X")
6: (CL-POSTGRES-TRIVIAL-UTF-8:READ-UTF-8-STRING #<unavailable argument> :NULL-TERMINATED #<unavailable argument> :STOP-AT-EOF #<unavailable argument> :CHAR-LENGTH #<unavailable argument> :BYTE-LENGTH #<unavailab
le argument>)
7: ((LAMBDA (#:G0 #:G1) :IN "/usr/share/common-lisp/source/postmodern/cl-postgres/interpret.lisp") #<SB-SYS:FD-STREAM for "socket, peer: /var/run/postgresql/.s.PGSQL.5432" {1007EED623}> 540)
8: ((LAMBDA (#:G39 PGLOADER.SOURCES:FIELDS) :IN PGLOADER.SOURCES:MAP-ROWS) #<SB-SYS:FD-STREAM for "socket, peer: /var/run/postgresql/.s.PGSQL.5432" {1007EED623}> #(#<CL-POSTGRES::FIELD-DESCRIPTION {1007F0AD63}> 
#<CL-POSTGRES::FIELD-DESCRIPTION {1007F0AF13}> #<CL-POSTGRES::FIELD-DESCRIPTION {1007F0B0C3}> #<CL-POSTGRES::FIELD-DESCRIPTION {1007F0B273}> #<CL-POSTGRES::FIELD-DESCRIPTION {1007F0B423}> #<CL-POSTGRES::FIELD-DE
SCRIPTION {1007F0B5D3}> #<CL-POSTGRES::FIELD-DESCRIPTION {1007F0B783}>))
9: (CL-POSTGRES::SEND-QUERY #<SB-SYS:FD-STREAM for "socket, peer: /var/run/postgresql/.s.PGSQL.5432" {1007EED623}> #<unavailable argument> #<unavailable argument>)
10: ((LABELS #:G4 :IN CL-POSTGRES:EXEC-QUERY))
11: ((:METHOD PGLOADER.SOURCES:MAP-ROWS (PGLOADER.SOURCE.PGSQL:COPY-PGSQL)) #<PGLOADER.SOURCE.PGSQL:COPY-PGSQL {1005D33833}> :PROCESS-ROW-FN #<FUNCTION (LAMBDA (PGLOADER.LOAD::ROW) :IN PGLOADER.LOAD:QUEUE-RAW-DA
TA) {1007EEBD0B}>) [fast-method]
svantevonerichsen6906 commented 7 months ago

Can't look deeper now, but trying to decode UTF-16 with UTF-8 would produce this problem, yes. Did you specify the --encoding option?

JosefMachytkaNetApp commented 7 months ago

@svantevonerichsen6906 Thank you for your comment, I tried it, but looks like in PostgreSQL to PostgreSQL mode pgloader does not allow this parameter. When I add it in command line, I get following messages and pgloader stops:

pgloader version 3.6.7~devel
compiled with SBCL 2.2.9.debian
sb-impl::*default-external-format* :UTF-8
tmpdir: #P"/tmp/pgloader/"
2024-02-23T12:39:52.004000Z NOTICE Starting pgloader, log system is ready.
2024-02-23T12:39:52.016001Z INFO Starting monitor
2024-02-23T12:39:52.024001Z LOG pgloader version "3.6.7~devel"
2024-02-23T12:39:52.024001Z INFO Stopping monitor

And if I try to add it into WITH part, pgloader complains on config file parsing:

2024-02-23T12:38:48.316014Z INFO Stopping monitor

What I am doing here?

At

          concurrency = 1,
          prefetch rows = 100,

                             ^ (Line 11, Column 27, Position 249)

In context PGSQL-OPTIONS:

While parsing PGSQL-OPTIONS. Expected:

     the character Tab
  or the character Newline
  or the character Return
  or the character Space
  or the string "--"
  or the string "/*"
  or the string "batch"
  or the string "concurrency"
  or the string "create"
  or the string "data"
  or the string "disable"
  or the string "downcase"
  or the string "drop"
  or the string "foreign"
  or the string "include"
  or the string "max"
  or the string "no"
  or the string "on"
  or the string "prefetch"
  or the string "preserve"
  or the string "quote"
  or the string "reset"
  or the string "schema"
  or the string "snake_case"
  or the string "truncate"
  or the string "uniquify"
  or the string "workers"

I showed only last 2 lines from WITH part which it was able to parse, next one is "encoding" and it fails on it because it is not among expected strings.

So unfortunately it looks like "encoding" cannot help here.

BTW, I used FreeTDS library with "client charset = UTF-8" so it should convert data to UTF-8 - it had been discussed already on StackOverflow. So it is surprising for me that Lisp seems to fail. Do you know if ODBC would work better here?