UTF-8 encoding error - Githubissues

darold / ora2pg

Ora2Pg is a free tool used to migrate an Oracle database to a PostgreSQL compatible schema. It connects your Oracle database, scan it automatically and extracts its structure or data, it then generates SQL scripts that you can load into PostgreSQL.

http://www.ora2pg.com/

GNU General Public License v3.0

1.02k stars 343 forks source link

UTF-8 encoding error #1559

Closed skizilbash closed 1 year ago

skizilbash commented 1 year ago

Hello,

Should i be concerned about the below warning/error?

[========================>] 1/1 tables (100.0%) end of scanning.
Unicode surrogate U+DBC0 is illegal in UTF-8 at /usr/lib64/perl5/IO/Handle.pm line 420.721 (1861 recs/sec) Unicode surrogate U+DF72 is illegal in UTF-8 at /usr/lib64/perl5/IO/Handle.pm line 420. Unicode surrogate U+DBC0 is illegal in UTF-8 at /usr/lib64/perl5/IO/Handle.pm line 420. Unicode surrogate U+DF72 is illegal in UTF-8 at /usr/lib64/perl5/IO/Handle.pm line 420. Unicode surrogate U+DBC0 is illegal in UTF-8 at /usr/lib64/perl5/IO/Handle.pm line 420.721 (1852 recs/sec) Unicode surrogate U+DF72 is illegal in UTF-8 at /usr/lib64/perl5/IO/Handle.pm line 420. [====> ] 4540000/22182760 rows (20.5%)

darold commented 1 year ago

Yes, you should use direct import to PostgreSQL it could avoid this problem (See PG_DSN in ora2pg.conf)

skizilbash commented 1 year ago

this is not import. i am extracting data from Oracle using ora2pg -t COPY -o data.sql

darold commented 1 year ago

Yes, this is what I have deduced: you are exporting data to files. My recommendation is to import them directly into PostgreSQL to avoid writing to file.

skizilbash commented 1 year ago

ok, i guess i can try that. However it did extract all the rows. I just wanted to know if this type of error is just a warning and wouldn't corrupt the data in the table.

darold commented 1 year ago

You can try to compare the dump of the table to a file and import it into a PG table versus a direct import to the PG table to see if this is just a warning or a corruption.

skizilbash commented 1 year ago

as per your recommendation i imported it directly using PG_DSN but that didn't help either. In fact it brought imported way less data.

[========================>] 1/1 tables (100.0%) end of scanning.
DBD::Pg::db pg_putcopyend failed: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0x80cs/sec) CONTEXT: COPY phy_t1, line 1131 at /var/lib/pgsql/PERL_DBI_DBD/lib64/perl5/Ora2Pg.pm line 14716. FATAL: ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xaf 0x80 CONTEXT: COPY phy_t1, line 1131 DBI::db=HASH(0x562e6a6c2030)->disconnect invalidates 1 active statement handle (either destroy statement handles or call finish on them before disconnecting) at /var/lib/pgsql/PERL_DBI_DBD/lib64/perl5/Ora2Pg.pm line 13745. Aborting export..

Changing client_encoding to 'latin1' didn't help either :(

skizilbash commented 1 year ago

$ ora2pg -c ora2pg.conf -t SHOW_ENCODING Current encoding settings that will be used by Ora2Pg: Oracle NLS_LANG AMERICAN_AMERICA.AL32UTF8 Oracle NLS_NCHAR AL32UTF8 Oracle NLS_TIMESTAMP_FORMAT YYYY-MM-DD HH24:MI:SS.FF Oracle NLS_DATE_FORMAT YYYY-MM-DD HH24:MI:SS PostgreSQL CLIENT_ENCODING UTF8 Perl output encoding '' Showing current Oracle encoding and possible PostgreSQL client encoding: Oracle NLS_LANG AMERICAN_AMERICA.AL32UTF8 Oracle NLS_NCHAR AL32UTF8 Oracle NLS_TIMESTAMP_FORMAT YYYY-MM-DD HH24:MI:SS.FF6 Oracle NLS_DATE_FORMAT YYYY-MM-DD HH24:MI:SS PostgreSQL CLIENT_ENCODING UTF8

darold commented 1 year ago

I was thinking that this could be a problem of locale but this is not the case. Then, you have no choice, export the tables which have non utf8 characters to files and use iconv to convert them to utf8. What does the command file table_output_file.sql return?

skizilbash commented 1 year ago

Below is the output. can you share the iconv syntax as i am using "psql dbname < ***_data.sql" to import the data.

$ file PHYCON_data.sql.bk PHYCON_data.sql.bk: UTF-8 Unicode text, with very long lines

darold commented 1 year ago

Sorry I have missed the file command option: file -bi table_output_file.sql and best would be to run the command on a part of the file that has such characters, not the whole file.

A possible iconv command could be:

iconv -f utf-16le -t utf-8 -o out.sql PHYCON_data.sql.bk

skizilbash commented 1 year ago

$ file -bi PHYCON_data.sql text/plain; charset=utf-8

i tried the above command and also with -c option but get this error.

$ iconv -f utf-16le -t utf-8 -o out.sql PHYCON_data.sql iconv: illegal input sequence at position 5395369598 postgres@rp000086565:/var/lib/pgsql/Phycon_o2p/oracle_service_name/data $ iconv -f utf-16le -t utf-8 -o out.sql -c PHYCON_data.sql iconv: incomplete character or shift sequence at end of buffer

skizilbash commented 1 year ago

@darold any recommendation on the above error?

skizilbash commented 1 year ago

i was able to fix it by running the below

iconv -f utf-16le -t UTF-8//TRANSLIT out.sql -o PHYCON_data.sql but now when i import i get "out of memory" error. i've even bumped work_mem from 4M to 200M but still get the below. Any idea?

$ psql phycon_prd01 < PHYCON_data.sql out of memory