dimitri / pgloader

Migrate to PostgreSQL in a single command!
http://pgloader.io
Other
5.37k stars 543 forks source link

ERROR Illegal :UTF-8 character needs more reporting, please #1250

Open kjohnsonecl opened 3 years ago

kjohnsonecl commented 3 years ago

I think it would be helpful if pgloader could provide primary key information on the record that triggers the UTF-8 error.

I am trying to convert a Bacula database from sqlite to postgresql using pgloader. pgloader seems like an amazing tool which I need as I am not a database specialist, more of a system and network administrator generalist. Because of this, I am probably overlooking ways of doing things directly with the database tools, and instead, using the tools I know.

I have been loading the tables one at a time and resolving issues in each one as I go, for the most part. (There are several tables with issues around timestamp casting that I have put off to the end.)

The filename table has two columns: a filename ID, and a filename. When I try to load that table, I encounter the "ERROR Illegal :UTF-8 character starting at position 34." error message, and table loading stops. At least, I think that causes table loading to stop. That is the only "ERROR" reported in the output file.

My thinking was that if I could find the record that triggers the UTF-8 error, I could figure out a solution to this problem. Or at least make progress. I tried several ways of examining the .sql file used to build the bacula.db file used for conversion (See first bash script below), but without useful results. Then I tried turning on pgloader -debug. That didn't help directly, but about that time I realized that perhaps I could figure something out by looking at the records that did get loaded (about 45000) vs. the ones that did not (about 400000).

The output from pg_dump lists the records from the filename table in filename ID order. This matches the order in the bacula.sql file, so it seemed not stupid to think that the next record in filename ID (46735) order was the problematic one. I deleted that record from the .sql file and rebuilt the db file using the first of the shell scripts. Then I re-ran pgloader with the second shell script. Same error.

I am willing to give up some assumptions, but I do not have a good idea of where to start. Hence, if the UTF-8 error message could come with more data...

Script used to modify and prepare database for conversion

#!/bin/bash
# Process bacula sqlite database before attempting sqlite -> postgreql conversion
# using pgloader.  Assumes bacula.db-ref in current working directory.
#
# 08 Jan 2021 - initial version
# 13 Jan 2021 - must use sqlite3, not sqlite!
#             - Test bacula.db-ref exists
#             - add missing 'date' commands start and finish
#             - remove 'Copy' field from JobMedia table
# 14 Jan 2021 - find out if the Korean (?) filename is
#               the troublesome one. (It was not.)
# 15 Jan 2021 - Try deleting the filename record right
#             - after the last one to load.  Does not help.

if [[ ! -f ./bacula.db-ref ]] ; then
    echo '** process_db ** File "bacula.db-ref" not found.'
    exit
fi

date
echo "** process_db ** remove old files"
time rm -vf bacula.sql-ref bacula.sql bacula.db

date
echo "** process_db ** dump db file to .sql"
time nice ionice echo ".dump" | sqlite3 bacula.db-ref >bacula.sql-ref

date
echo "** process_db ** modify .sql file"
#
# Table PathVisibility: types int4, int8
# Table RestoreObject: fieldname  ObejctCompression
# Table Snapshot: type TINYBLOB
# Table JobMedia: remove field "Copy"
# Table Filename: troublesome UTF-8 Error

time nice ionice sed "s/ int4 / integer /g" bacula.sql-ref | \
sed "s/ int8 / bigint /g" | \
sed "s/  ObejctCompression /  ObjectCompression /g" | \
sed "s/ TINYBLOB / BLOB /g" | \
sed "/INSERT INTO JobMedia/ s/,0);/);/g" | \
sed "/ Copy     INTEGER UNSIGNED/ d" |\
sed "/INSERT INTO Filename VALUES(46735,/ d" \
 >bacula.sql

date
echo "** process_db ** load bacula.db from modified .sql file"
time nice ionice sqlite3 bacula.db <bacula.sql
date

Script used to run pgloader

#!/bin/bash
# repeatable tries at bacula database conversion
#
# 07 Jan 2021 -- add call to modified make_postgresql_tables
#             -- Add echo labels to different sections
#             -- Switch from -d (debug) to -v (verbose)
# 15 Jan 2021 -- Try -d to track down UTF-8 error.
#             -- Switch from -d (debug) to -v (verbose)
date
echo "** KLJ ** Drop old database"
time nice ionice dropdb --if-exists newpg
date
echo "** KLJ ** Create newpg"
time nice ionice createdb newpg
date
echo "** KLJ ** make_postgresql_tables"
time nice ionice ./make_postgresql_tables
date
echo "** KLJ ** run pgloader"
time nice ionice pgloader -v pg.load
date

pg.load file used to attempt conversion:

--
-- EDIT THIS FILE TO MATCH YOUR BUG REPORT
-- comments have been stripped

load database
     from sqlite:///home/kjohnson/bacula.db
     into postgresql://root:root@localhost:5433/newpg
     with include no drop, create no tables, create no indexes,
       reset sequences, preserve index names, no foreign keys, data only

     including only table names like 'filename'
     excluding table names like 'version', 'status', 'nextid', 'jobmedia'
     set work_mem to '1GB', maintenance_work_mem to '4GB'
;
...
2021-01-15T18:39:04.419000Z NOTICE COPY filename
2021-01-15T18:39:04.486000Z ERROR Illegal :UTF-8 character starting at position 34.
2021-01-15T18:39:04.755000Z NOTICE Completing PostgreSQL database.
2021-01-15T18:39:04.755000Z NOTICE Reset sequences
2021-01-15T18:39:04.960000Z LOG report summary reset
             table name     errors       read   imported      bytes      total time       read      write
-----------------------  ---------  ---------  ---------  ---------  --------------  ---------  ---------
                  fetch          0          0          0                     0.000s
        fetch meta data          0          1          1                     0.030s
-----------------------  ---------  ---------  ---------  ---------  --------------  ---------  ---------
               filename          1      46245      46245  1009.7 kB          0.322s     0.067s     0.150s
-----------------------  ---------  ---------  ---------  ---------  --------------  ---------  ---------
COPY Threads Completion          0          4          4                     0.322s
        Reset Sequences          0          1          1                     0.139s
       Install Comments          0          0          0                     0.000s
-----------------------  ---------  ---------  ---------  ---------  --------------  ---------  ---------
      Total import time          1      46245      46245  1009.7 kB          0.461s

From pg_dump:

...
46730   Global Positioning System.HTM
46731   Learn more about the hyperlink database.htm
46732   menu bar.htm
46733   mod-Flight Recorder Set XX-XXX-XX (V)1.htm
\.

from bacula.sql-ref:

...
INSERT INTO Filename VALUES(46730,'Global Positioning System.HTM');
INSERT INTO Filename VALUES(46731,'Learn more about the hyperlink database.htm');
INSERT INTO Filename VALUES(46732,'menu bar.htm');
INSERT INTO Filename VALUES(46733,'mod-Flight Recorder Set XX-XX-XX (V)1.htm');

INSERT INTO Filename VALUES(46735,'Pic MEDB Right Interior Lower.jpg');
INSERT INTO Filename VALUES(46736,'pic 2 - Propeller Synchrophaser Control.jpg');
...
  Expected roughly 400000 additional records.
kjohnsonecl commented 3 years ago

Well, now I feel ignorant and stupid, because just dumping in plain text (original issue) comes out like I am SHOUTING the bits that were comments in shell scripts. argh.

kjohnsonecl commented 3 years ago

Sometimes you need to step outside your comfort zone.

grep won't display the offending line in the .sql file, but...

mumble# sqlite3 bacula.db
sqlite> SELECT * FROM Filename WHERE FilenameId='46734';
46734|Pic - Pilot?s Interconnecting Box �(105A15).htm

So, one way or another I can solve the problem with that record and then move on to find any more lurking in that table.

I still think it would be a plus for the message to have more info.