dimitri / pgloader

Migrate to PostgreSQL in a single command!
http://pgloader.io
Other
5.47k stars 550 forks source link

Not clear how to log bad rows in input CSV files without skipping rest of input file #855

Open basiliosz opened 6 years ago

basiliosz commented 6 years ago

I'm creating a pgloader command file to import a set of CSV files supplied by another team. Loading well formed files woks fine, but our input files might be badly formatted. Ideally I'd like to skip and log all bad rows in my input files; the documentation says:

To enable retrying the data and loading the good parts, use the option on error resume next, which is the default to file based data loads (such as CSV, IXF or DBF).

but I'm not quite able to get it to work.

This is my command file:

LOAD CSV
  FROM
    ALL FILENAMES MATCHING ~/{{{FILE_REGEXP}}}/
    IN DIRECTORY '{{STAGING_PATH}}' 
    WITH ENCODING UTF8
  INTO postgresql://{{PGUSER}}:{{PGPASSWORD}}@{{PGHOST}}/{{PGDATABASE}}
    TARGET TABLE {{TABLE_NAME}}

  WITH
    csv header,
    on error resume next,
    fields terminated by '|',
    truncate
;

I have a couple of files where I inserted errors on purpose for testing purposes. Here's a part of the first:

"id"|"code"|"first_name"|"first_name_1"|"last_name"
"jba9ltm3rk58e9bg"|"O1smr83CmvMkqnrV"|"Adam"|"Ant"
"wew6tiq5zmkak19y","I4hlQYrcCSlN3KnV"|Barbara|"Barb"|Bell
"mtsunn0i0ac2t30o"|"vePl6Nti7q1s9YeI"|"Carl"|"Chip"|"Cook"

And a part of the second:

"id"|"code"|"first_name"|"first_name_1"|"last_name"
"zcwoyy9e2yiuvan3"|"KNK59muKY5AcfQsd"|"Edward"|""|"Ember"
"fvdagry1jbbgtv63","bJnYxUUHo1R3rlYK","Faith","Faith","Flint"
"k606tkzte4t8mjv2"|"qskHKqVvwbPRE0xY"|"Gareth"|"George"|"Gatsby"

At the first error, regardless of whether I use on error resume next or on error stop, pgloader stops processing the current file and moves on to the next. For example I never see Carl Cook and Gareth Gatsby in my database.

Pgloader creates a directory named after my target table in the output directory, but there is nothing in it, although the documentation says:

At the end of a load containing rejected rows, you will find two files in the root-dir location, under a directory named the same as the target database of your setup. The filenames are the target table, and their extensions are .dat for the rejected data and .log for the file containing the full PostgreSQL client side logs about the rejected data.

I also tried adding batch rows = 1 in the hopes pgloader would move on to the next batch of one row, but there was no change in the output.

I'm probably doing something wrong. What can I check?

michaelmabingnay commented 4 years ago

Hi,

I have the same concern on how to log bad rows in input CSV files. Are you able to resolve this problem? Thanks.