dimitri / pgloader

Migrate to PostgreSQL in a single command!
http://pgloader.io
Other
5.48k stars 551 forks source link

Building on Mac OS #80

Closed Themanwithoutaplan closed 9 years ago

Themanwithoutaplan commented 10 years ago

I've managed to build on Mac OS using SBCL from MacPorts (COMPRESS_CORE=NO) but I still can't use pgloader:

pgloader sample.load 

debugger invoked on a BORDEAUX-THREADS::BORDEAUX-MP-CONDITION:
  There is no thread support in this instance.

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [ABORT] Exit application

(BORDEAUX-THREADS::%MAKE-THREAD #<unavailable argument> #<unavailable argument>)
0] 0
dimitri commented 10 years ago

I've been using brew to install SBCL, I don't know about MacPorts. The error you have is because you're using a not-so-recent version of SBCL where threads are not enabled by default, so that you need to compile SBCL with the following options: --with-sb-thread --with-sb-core-compression.

Themanwithoutaplan commented 10 years ago

MacPorts is similar to, but older than Homebrew. As the name suggests, it has similarity with the BSD ports system: https://www.macports.org

As I've never had any problems with it, I've never seen the need to switch to the new hotness.

From the info:

bash-4.3# port info sbcl@fancy
sbcl @1.2.0 (lang)
Variants:             fancy, html, pdf, threads

Description:          Steel Bank Common Lisp (SBCL) is a Open Source implementation of ANSI Common Lisp. It provides an interactive environment including an integrated
                      native compiler, interpreter, and debugger. SBCL is quite suitable for the creation of long running system services, as the compiler trades the
                      time for a long initial compilation for blazingly fast loading of its binary runtime fasl representation.
Homepage:             http://www.sbcl.org

Platforms:            darwin
License:              BSD

Unfortunately, there seems to be a problem building the +threads variant.

dimitri commented 10 years ago

Here's what I have with brew:

~ brew info sbcl
sbcl: stable 1.2.0 (bottled), HEAD
http://www.sbcl.org/
/usr/local/Cellar/sbcl/1.1.1 (269 files, 50M)
/usr/local/Cellar/sbcl/1.1.4 (273 files, 50M)
/usr/local/Cellar/sbcl/1.1.10 (281 files, 52M)
  Poured from bottle
/usr/local/Cellar/sbcl/1.1.14 (45 files, 50M)
  Poured from bottle
/usr/local/Cellar/sbcl/1.1.17 (45 files, 52M)
  Poured from bottle
/usr/local/Cellar/sbcl/1.1.18 (45 files, 51M)
  Poured from bottle
/usr/local/Cellar/sbcl/1.2.0 (47 files, 52M) *
  Poured from bottle
From: https://github.com/Homebrew/homebrew/commits/master/Library/Formula/sbcl.rb
==> Options
--32-bit
    Build 32-bit only
--with-internal-xref
    Include XREF information for SBCL internals (increases core size by 5-6MB)
--with-ldb
    Include low-level debugger in the build
--without-core-compression
    Build SBCL without support for compressed cores and without a dependency on zlib
--without-threads
    Build SBCL without support for native threads
--HEAD
    install HEAD version

So the default are reversed basically. Building SBCL on the mac works well for me, too.

Did you try http://pgloader.io/files/pgloader-3.0.99.pkg ?

Themanwithoutaplan commented 10 years ago

I was using the package but, seeing as I wanted to try the improvements on handling camelCase syntax, I thought I'd have to build from source. Just giving the "threads" build another go, though I'm not sure if the library it wants to use is correct. The "fancy" build says it supports threading.

I'll be more than happy to use a binary if you can supply one. Is 3.0.99 being built off source?

dimitri commented 10 years ago

Yes, there's a pkg target in the main Makefile. If you want the very recent bug fixes and enhancement, it's best to learn how to build pgloader and have the right SBCL version. You could also try pgloader --self-upgrade /path/to/git/checkout/of/pgloader, but I've been reported problems with that solution.

Themanwithoutaplan commented 10 years ago

I wondered why there wasn't an install target. Well, seeing as it won't build with threads on my machine that's not much help. I'll file a bug report with the port maintainer.

It looks like self-upgrade isn't supported:

 pgloader --self-upgrade src/pgloader 
pgloader [ option ... ] command-file ...
  --help -h                       boolean  Show usage and exit. 
  --version -V                    boolean  Displays pgloader version and exit. 
  --quiet -q                      boolean  Be quiet 
  --verbose -v                    boolean  Be verbose 
  --debug -d                      boolean  Display debug level information. 
  --client-min-messages           string   Filter logs seen at the console  (default: "warning")
  --log-min-messages              string   Filter logs seen in the logfile  (default: "notice")
  --root-dir -D                   string   Output root directory.  (default: #P"//tmp/pgloader//")
  --upgrade-config -U             boolean  Output the command(s) corresponding to .conf file for
                                           v2.x 
  --list-encodings -E             boolean  List pgloader known encodings and exit. 
  --logfile -L                    string   Filename where to send the logs. 
  --load -l                       string   Read user code from file 
dimitri commented 10 years ago

Oh yeah you're right self-upgrade if from later than 3.0.99, sorry about that. Try if http://pgloader.io/files/pgloader-3.0.ecd4f31.pkg works for you?

Themanwithoutaplan commented 10 years ago

Thanks, syntax seems to be working okay. Now. So, the parser is taking care of mapping, say createDate in the FROM clause to "createDate" in the INTO? I could live with requiring both be quoted.

Themanwithoutaplan commented 10 years ago

I'm now getting data related errors 2014-06-16T19:15:13.657000+02:00 ERROR Database error 22P02: invalid input syntax for integer: "\N"

Looking at a sample line that looks reasonable:

"16399104","1401620402","All","Jun 1 2014","226","140601_0_2","3","http://www.angege.com/","http://www.angege.com/","50017",\N,"1401615028","901","1333",\N,"2358","2533","2100","71","1574",\N,"29","6","1","1","21","4","16","0","0","0","0","0","32741","6445","6847","1863","17586","8847","7589","0","0","0","0","0","5165","5","22","3","2","0","0","7","348","0","28","0","1","0","0","14746","1418","26","0","16","942","1008","0","0","0","5","",""

I think that's MySQL speak for NULL which shouldn't actually be in a CSV. How do I escape/convert it?

dimitri commented 10 years ago

You need to use the null if option when declaring the CSV source field, as in the newly committed example at https://github.com/dimitri/pgloader/blob/master/test/csv-nulls.load.

LOAD CSV
     FROM INLINE (id, number null if '\N', data)
     INTO postgresql:///pgloader?nullif

     BEFORE LOAD DO
      $$ drop table if exists nullif; $$,
      $$ CREATE TABLE nullif
         (
            id     serial primary key,
            number integer,
            data   text
         );
      $$

     WITH fields terminated by ',',
          fields enclosed by '"',
          fields escaped by backslash-quote;

"1",\N,"testing nulls"
"2","2","another test"
Themanwithoutaplan commented 10 years ago

That goes in the FROM clause rather than a transformation function?

I'm nearly there, I think, just got a problem with an integer column that can be NULL

2014-06-16T20:25:55.132000+02:00 ERROR Database error 22P02: invalid input syntax for integer: "\N"
CONTEXT: COPY pages, line 3, column onContentLoaded: "\N"

My current script FWIW

LOAD CSV
    FROM 'httparchive_Jun_1_2014_pages.csv' (
    pageid, createDate, archive, label, crawlid,
    wptid, wptrun, url, urlShort, urlhash,
    cdn null if '\N', startedDateTime, TTFB, renderStart, onContentLoaded null if '\N',
    onLoad, fullyLoaded, visualComplete, PageSpeed, SpeedIndex null if '\N',
    rank null if '\N', reqTotal, reqHtml, reqJS, reqCSS,
    reqImg, reqGif, reqJpg, reqPng, reqFont,
    reqFlash, reqJson, reqOther, bytesTotal, bytesHtml,
    bytesJS, bytesCSS, bytesImg, bytesGif, bytesJpg,
    bytesPng, bytesFont, bytesFlash, bytesJson, bytesOther,
    bytesHtmlDoc, numDomains, maxDomainReqs, numRedirects, numHttps,
    numGlibs, numErrors, numCompressed, numDomElements, maxageNull,
    maxage0, maxage1, maxage30, maxage365, maxageMore,
    gzipTotal, gzipSavings, _connections, _adult_site, avg_dom_depth,
    document_height, document_width, localstorage_size, sessionstorage_size, num_iframes,
    num_scripts, doctype, meta_viewport)

    INTO postgresql://postgres@localhost/http?pages (
    pageid, "createDate", archive, label, wptid, wptrun, url, cdn, "startedDateTime",
    "TTFB", "renderStart", "onContentLoaded", "onLoad", "fullyLoaded", "visualComplete", "PageSpeed",
    "speedIndex", rank, "reqTotal", "reqHTML", "reqJS", "reqCSS", "reqImg", "reqGIF", "reqJPG",
    "reqPNG", "reqFont", "reqFlash", "reqJSON", "reqOther", "bytesTotal", "bytesHTML", "bytesJS",
    "bytesCSS", "bytesImg", "bytesGIF", "bytesJPG", "bytesPNG", "bytesFont", "bytesFlash", "bytesJSON",
    "bytesHTMLDoc",
    "bytesOther", "numDomains", "maxDomainReqs", "numRedirects", "numHTTPS", "numGlibs", "numErrors",
    "numCompressed", "numDomElements", "maxageNull", "maxage0", "maxage1", "maxage30", "maxage365",
    "maxageMore", "gzipTotal", "gzipSavings"
    )

    WITH
        fields optionally enclosed by '"',
        fields escaped by double-quote,
        fields terminated by ','

    SET
        work_mem to '128MB',
        standard_conforming_strings to 'on';

The data is from http://www.httparchive.org/downloads.php so gzip support would be nice. Thanks very much for your help!

Themanwithoutaplan commented 10 years ago

Duh! Silly me - the important line is fields escaped by backslash-quote. Seems to be working now, thanks.

Themanwithoutaplan commented 10 years ago

There is still a problem for integer fields which can accept NULL.

debugger invoked on a SIMPLE-TYPE-ERROR in thread
#<THREAD "lparallel" RUNNING {1007F86A13}>:
  Argument Y is not a NUMBER: NIL
dimitri commented 10 years ago

I will need more information on that one, if possible with a small amount of data to reproduce the problem, or at least the --debug or --debug --verbose levels of detail in the output.

Themanwithoutaplan commented 10 years ago

import_error Sure. Just as soon as I figure out how to post a .gz to this issue!

As it seems only image uploads are supported, I renamed it to .jpg. gzip should still be able to unpack it, I hope.

dimitri commented 10 years ago

I could reproduce the "\N" error message with the following test data:

"1",\N,"testing nulls"
"2","2","another test"
"3",\\N,"failing test"

The third line here gives me:

2014-06-17T11:37:01.204000+02:00 ERROR Database error 22P02: invalid input syntax for integer: "\\N"
CONTEXT: COPY nullif, line 3, column number: "\\N"

So I think your input file is erroneous, and as you hinted at the beginning of this issue you will need a specific transformation function to tell pgloader that really in your case \N and \\N are the same thing. Something like the following, to put in the INTO clause:

    "speedIndex" integer using (unless (or (null speedIndex) (string-equal speedIndex "\\N") (string-equal speedIndex "\\\\N")) speedIndex)

Now about the Argument Y is not a NUMBER: NIL, I don't see where it's coming from in your traces.

Themanwithoutaplan commented 10 years ago

The log was from a run where I just excluded "onContentLoaded" so it runs faster. I'm not sure if the double-quoting isn't erroneous.

I'm working with this data file: http://www.archive.org/download/httparchive_downloads_Jun_1_2014/httparchive_Jun_1_2014_pages.csv.gz

I've just done another run including "onContentLoaded" this is a sample from the log

2014-06-17T11:33:48.873000+02:00 DATA < #("16399123" "1401620403" "All" "Jun 1 2014" "140601_0_1C" "3"
    "http://www.findaproperty.com/" NIL "1401615051" "745" "1556" "8344" "8470"
    "8100" "92" "2039" NIL "46" "6" "14" "3" "18" "6" "3" "8" "3" "0" "0" "2"
    "443862" "39543" "277460" "32365" "31629" "1389" "2923" "25911" "62865" "0"
    "0" "15297" "\\N" "0" "22" "15" "3" "0" "2" "2" "24" "3456" "0" "16" "7"
    "1" "7" "15" "424764" "0")

And the corresponding entry from the file:

"16399123","1401620403","All","Jun 1 2014","226","140601_0_1C","3","http://www.findaproperty.com/","http://www.findaproperty.com/","8722",\N,"1401615051","745","1556",\N,"8344","8470","8100","92","2039",

You can see that some of the \Ns are being converted to nil but not all which probably means I've got the positioning wrong. The data in the log - does it correspond to what is read from the file or what is going to be passed to COPY?

Themanwithoutaplan commented 10 years ago

I've double-checked the files. As the problem seems related to "onContentLoaded" and "SpeedIndex" only I suspect that there is a mismatch in field name matching because of the quoting. When I quote the two fields in the FROM clause the cast happens!

So, now the import is working but only seems to be importing a few rows:

sb-impl::*default-external-format* :UTF-8
tmpdir: #P"/var/folders/q0/r5j5dzp55p734_sdh1cblxn80000gn/T/"
2014-06-17T12:24:37.036000+02:00 LOG Starting pgloader, log system is ready.
2014-06-17T12:24:37.055000+02:00 INFO Starting monitor
2014-06-17T12:24:37.063000+02:00 LOG Main logs in '/private/tmp/pgloader/pgloader.log'
2014-06-17T12:24:37.064000+02:00 LOG Data errors in '/private/tmp/pgloader/'
2014-06-17T12:24:37.064000+02:00 LOG Parsing commands from file #P"/Users/charlieclark/Sites/http/httparchive.load"
2014-06-17T12:24:37.265000+02:00 DEBUG CONNECT
2014-06-17T12:24:37.265000+02:00 DEBUG SET work_mem TO '128MB'
2014-06-17T12:24:37.265000+02:00 DEBUG BEGIN
2014-06-17T12:24:37.265000+02:00 DEBUG CONNECT
2014-06-17T12:24:37.265000+02:00 DEBUG SET work_mem TO '128MB'
2014-06-17T12:24:37.265000+02:00 DEBUG BEGIN
2014-06-17T12:24:37.265000+02:00 NOTICE delete from pages
        where label = 'Jun 1 2014';
2014-06-17T12:24:48.092000+02:00 NOTICE COPY http.pages
2014-06-17T12:24:48.093000+02:00 DEBUG pgsql:copy-from-queue: pages ("pageid" "createDate" "archive" "label" "wptid"
                              "wptrun" "url" "cdn" "startedDateTime" "TTFB"
                              "renderStart" "onContentLoaded" "onLoad"
                              "fullyLoaded" "visualComplete" "PageSpeed"
                              "SpeedIndex" "rank" "reqTotal" "reqHTML" "reqJS"
                              "reqCSS" "reqImg" "reqGIF" "reqJPG" "reqPNG"
                              "reqFont" "reqFlash" "reqJSON" "reqOther"
                              "bytesTotal" "bytesHTML" "bytesJS" "bytesCSS"
                              "bytesImg" "bytesGIF" "bytesJPG" "bytesPNG"
                              "bytesFont" "bytesFlash" "bytesJSON"
                              "bytesHTMLDoc" "bytesOther" "numDomains"
                              "maxDomainReqs" "numRedirects" "numHTTPS"
                              "numGlibs" "numErrors" "numCompressed"
                              "numDomElements" "maxageNull" "maxage0" "maxage1"
                              "maxage30" "maxage365" "maxageMore" "gzipTotal"
                              "gzipSavings")
2014-06-17T12:24:48.094000+02:00 INFO COPY FROM #P"/Users/charlieclark/Sites/http/httparchive_Jun_1_2014_pages.csv"
2014-06-17T12:24:48.094000+02:00 DEBUG CONNECT ("http" "postgres" NIL "localhost" :PORT 5432)
2014-06-17T12:24:48.094000+02:00 DEBUG SET work_mem TO '128MB'
2014-06-17T12:24:50.754000+02:00 ERROR :WAITING-FOR-NEXT fell through ECASE expression.
Wanted one of (:COLLECTING :COLLECTING-QUOTED :WAITING).
2014-06-17T12:24:50.755000+02:00 DEBUG BEGIN
2014-06-17T12:24:50.755000+02:00 DEBUG SET LOCAL work_mem TO '128MB'

                    table name       read   imported     errors            time

------------------------------  ---------  ---------  ---------  --------------
                   before load          1          1          0         10.793s
------------------------------  ---------  ---------  ---------  --------------

                         pages      17255      17255          1          4.633s
------------------------------  ---------  ---------  ---------  --------------
             Total import time      17255      17255          1         15.426s

2014-06-17T12:24:52.764000+02:00 DEBUG copy-batch pages 17255 rows
2014-06-17T12:24:52.764000+02:00 INFO Stopping monitor

There should be about 290000 entries or am I reading this wrong?

dimitri commented 10 years ago

Fields names, case, quoting and transforms

I'm quite surprised about the case issue, in particular because pgloader uses the following function to go from strings in the parsed command file to internal symbols to destructure against. Note the usage of the string-upcase fonction.

(defun intern-symbol (symbol-name)
  (intern (string-upcase symbol-name)
      (find-package "PGLOADER.TRANSFORMS")))

Summary counters and logs

The logs you have indicates that it did only one batch processing, from the fact that there's a single DEBUG copy-batch message.

Themanwithoutaplan commented 10 years ago

Can't say much about LISP but there's obviously a mismatch between one mapping and another. It associates the INTO and FROM fields fine but trips up over the need to cast \N to Nil.

How do I get it to do more batches? I tried playing around with batch rows = settings to no avail.

Themanwithoutaplan commented 10 years ago

Also, might it make sense to always quote column names to Postgres?

Themanwithoutaplan commented 10 years ago

I think I've found the reason why only one batch is running. There are some lines which have double-backslashes in them. Removing the backslashes and the import runs past these lines:

"16417153","1401640227","All","Jun 1 2014","226","140601_0_JYY","2","http://www.teletalk.com.bd/","http://www.teletalk.com.bd/","9502",\N,"1401635523","655","3314","803","10252","10918","13000","88","5451","15372","41","5","0","1","34","2","31","1","0","1","0","0","2165774","49171","0","13165","2101237","282029","1790568","28640","0","2201","0","0","484","1","41","0","1","0","0","0","8","0","41","0","0","0","0","64526","54067","39","0","2","698","1008","0","0","0","0","HTML -//W3C//DTD HTML 4.01 Frameset//EN\\",""

Shouldn't that be raising an exception rather than just grinding to a halt? Better still, it should be being parsed. :-)

dimitri commented 10 years ago

I think I just saw the problem with the case sensibility for applying null if rules, thanks for insisting. Will inquire tomorrow about double slashes.

Themanwithoutaplan commented 10 years ago

FWIW I filed a bug upstream about the "\" https://github.com/HTTPArchive/httparchive/issues/25 which shouldn't really be happening.

Themanwithoutaplan commented 10 years ago

A quick note on building OS X - there seems to be a problem building SBCL with threads on Mavericks if there are any non-ASCII characters in the file listing of the root directory. See https://trac.macports.org/ticket/40980 for details and a workaround. From what I can tell the Homebrew guys are basing their formula on MacPorts so I wouldn't be surprised if it's an issue there as well. Of course, as long as the binary supports threads you won't notice.

dimitri commented 10 years ago

By the way, it doesn't appear that the backslashes are a problem in the CSV parsing:

CL-USER> (with-open-file (s "/Users/dim/dev/temp/80.csv")
           (cl-csv:read-csv s
                            :separator #\,
                            :quote #\"
                            :escape "\"\""))
(("16417153" "1401640227" "All" "Jun 1 2014" "226" "140601_0_JYY" "2"
  "http://www.teletalk.com.bd/" "http://www.teletalk.com.bd/" "9502" "\\N"
  "1401635523" "655" "3314" "803" "10252" "10918" "13000" "88" "5451" "15372"
  "41" "5" "0" "1" "34" "2" "31" "1" "0" "1" "0" "0" "2165774" "49171" "0"
  "13165" "2101237" "282029" "1790568" "28640" "0" "2201" "0" "0" "484" "1"
  "41" "0" "1" "0" "0" "0" "8" "0" "41" "0" "0" "0" "0" "64526" "54067" "39"
  "0" "2" "698" "1008" "0" "0" "0" "0"
  "HTML -//W3C//DTD HTML 4.01 Frameset//EN\\\\" ""))

Note that given the UI in use, we expect backslashes to appear doubled here.

Themanwithoutaplan commented 10 years ago

I know it's a mess but how do I work around it?

Themanwithoutaplan commented 10 years ago

Now that I'm able to build pgloader I'm able to try stuff out on my machine. Nice to see the null if rules working. Would it be possible to support WITH quote identifiers for CSV imports as is possible when importing from MySQL?

dimitri commented 10 years ago

I'm not sure what you mean. You can already just quote the column names, right?

Themanwithoutaplan commented 10 years ago

Sure, but seeing as there is already WITH quote identifiers when copying from MySQL I thought it would be reasonable to be able to use it here. Lowercasing camelCase field names is bound to trip people up.

dimitri commented 10 years ago

The difference is that the MySQL source provides the column names whereas with a CSV source the command itself has to provide for the column names, so that I don't think quote identifiers makes sense in that context.

dimitri commented 10 years ago

I managed to finally reproduce your error case with the following SQL table definition and reworked load file (had to fix a number of column spellings so that the case would match PostgreSQL column case):

CREATE TABLE pages (
  "pageid" serial  ,
  "createDate" integer  ,
  "archive" varchar(16) ,
  "label" varchar(32) ,
  "crawlid" integer  ,
  "wptid" varchar(64) ,
  "wptrun" smallint  ,
  "url" text,
  "urlShort" varchar(255) ,
  "urlhash" smallint  ,
  "cdn" varchar(64) ,
  "startedDateTime" integer  ,
  "TTFB" integer  ,
  "renderStart" integer  ,
  "onContentLoaded" integer  ,
  "onLoad" integer  ,
  "fullyLoaded" integer  ,
  "visualComplete" integer  ,
  "PageSpeed" smallint  ,
  "SpeedIndex" integer  ,
  "rank" integer  ,
  "reqTotal" smallint  ,
  "reqHtml" smallint  ,
  "reqJS" smallint  ,
  "reqCSS" smallint  ,
  "reqImg" smallint  ,
  "reqGif" smallint  ,
  "reqJpg" smallint  ,
  "reqPng" smallint  ,
  "reqFont" smallint  ,
  "reqFlash" smallint  ,
  "reqJson" smallint  ,
  "reqOther" smallint  ,
  "bytesTotal" integer  ,
  "bytesHtml" integer  ,
  "bytesJS" integer  ,
  "bytesCSS" integer  ,
  "bytesImg" integer  ,
  "bytesGif" integer  ,
  "bytesJpg" integer  ,
  "bytesPng" integer  ,
  "bytesFont" integer  ,
  "bytesFlash" integer  ,
  "bytesJson" integer  ,
  "bytesOther" integer  ,
  "bytesHtmlDoc" integer  ,
  "numDomains" integer  ,
  "maxDomainReqs" integer  ,
  "numRedirects" integer  ,
  "numErrors" integer  ,
  "numGlibs" integer  ,
  "numHttps" integer  ,
  "numCompressed" integer  ,
  "numDomElements" integer  ,
  "maxageNull" integer  ,
  "maxage0" integer  ,
  "maxage1" integer  ,
  "maxage30" integer  ,
  "maxage365" integer  ,
  "maxageMore" integer  ,
  "gzipTotal" integer  ,
  "gzipSavings" integer  ,
  "_connections" smallint  ,
  "_adult_site" smallint ,
  "avg_dom_depth" smallint  ,
  "document_height" integer  ,
  "document_width" integer  ,
  "localstorage_size" integer  ,
  "sessionstorage_size" integer  ,
  "num_iframes" smallint  ,
  "num_scripts" smallint  ,
  "doctype" varchar(255) ,
  "meta_viewport" varchar(255) ,
  PRIMARY KEY (pageid),
  UNIQUE ("label","urlShort")
);

And

LOAD CSV
    FROM 'httparchive_Jun_1_2014_pages.csv'
    (
    pageid, createDate, archive, label, crawlid,
    wptid, wptrun, url, urlShort, urlhash,
    cdn null if '\N', startedDateTime, TTFB, renderStart, onContentLoaded null if '\N',
    onLoad, fullyLoaded, visualComplete, PageSpeed, SpeedIndex null if '\N',
    rank null if '\N', reqTotal, reqHtml, reqJS, reqCSS,
    reqImg, reqGif, reqJpg, reqPng, reqFont,
    reqFlash, reqJson, reqOther, bytesTotal, bytesHtml,
    bytesJS, bytesCSS, bytesImg, bytesGif, bytesJpg,
    bytesPng, bytesFont, bytesFlash, bytesJson, bytesOther,
    bytesHtmlDoc, numDomains, maxDomainReqs, numRedirects, numHttps,
    numGlibs, numErrors, numCompressed, numDomElements, maxageNull,
    maxage0, maxage1, maxage30, maxage365, maxageMore,
    gzipTotal, gzipSavings, _connections, _adult_site, avg_dom_depth,
    document_height, document_width, localstorage_size, sessionstorage_size, num_iframes,
    num_scripts, doctype, meta_viewport
    )

    INTO postgresql:///pgloader?pages
    (
    pageid, "createDate", archive, label, wptid, wptrun, url, cdn, "startedDateTime",
    "TTFB", "renderStart", "onContentLoaded", "onLoad", "fullyLoaded", "visualComplete", "PageSpeed",
    "SpeedIndex", rank, "reqTotal", "reqHtml", "reqJS", "reqCSS", "reqImg", "reqGif", "reqJpg",
    "reqPng", "reqFont", "reqFlash", "reqJson", "reqOther", "bytesTotal", "bytesHtml", "bytesJS",
    "bytesCSS", "bytesImg", "bytesGif", "bytesJpg", "bytesPng", "bytesFont", "bytesFlash", "bytesJson",
    "bytesHtmlDoc",
    "bytesOther", "numDomains", "maxDomainReqs", "numRedirects", "numHttps", "numGlibs", "numErrors",
    "numCompressed", "numDomElements", "maxageNull", "maxage0", "maxage1", "maxage30", "maxage365",
    "maxageMore", "gzipTotal", "gzipSavings"
    )

    WITH
        truncate,
        fields optionally enclosed by '"',
        fields escaped by backslash-quote,
        fields terminated by ','

    SET
        work_mem to '128MB',
        standard_conforming_strings to 'on';

The bug is a cl-csv bug where the parser can't continue doing its job, so pgloader stops right at the problem:

2014-07-21T14:57:50.001000+02:00 LOG Starting pgloader, log system is ready.
2014-07-21T14:57:50.002000+02:00 LOG Parsing commands from file #P"/Users/dim/dev/temp/pgloader-issue-80.load"
2014-07-21T14:57:52.364000+02:00 ERROR :WAITING-FOR-NEXT fell through ECASE expression.
Wanted one of (:COLLECTING :COLLECTING-QUOTED :WAITING).

                    table name       read   imported     errors            time
------------------------------  ---------  ---------  ---------  --------------
                         pages      17255      17255          1          2.560s
------------------------------  ---------  ---------  ---------  --------------
             Total import time      17255      17255          1          2.560s

As soon as we manage to fix the bug in cl-csv the pgloader run should be fine.

schmmd commented 9 years ago

I hit this issue too. This issue makes the tool fairly useless for me.

dimitri commented 9 years ago

Do you happen to have a simpler test case? It would help coming towards a solution, IIRC the necessary bits have been added to cl-csv already in the meantime...

Themanwithoutaplan commented 9 years ago

Don't know about the SBCL part but I have a workaround where I pipe the file through sed to strip the offending characters: gzip -d -c src.gz | sed 's/\\\\//' | pgloader conf.load

dimitri commented 9 years ago

I would like that answer to be good enough to close the issue but I would prefer a very easy .load file with INLINE data to reproduce and fix the problem, of course...

Themanwithoutaplan commented 9 years ago

I'm still using the same data set with the same crap line about 11,400 lines in. I'll see if I can distill into a test case.

Themanwithoutaplan commented 9 years ago

You should be able to reproduce the error with using this https://gist.github.com/Themanwithoutaplan/d3c8261b406afea9fe84

dimitri commented 9 years ago

Thanks for your reduced test-case, and thanks to cl-csv new escape mode, the bug seems fixed to me.

Here's the test-case I've been using:

LOAD CSV
    FROM inline (
    pageid, createDate, archive, label, crawlid,
    wptid, wptrun, url, urlShort, urlhash,
    cdn [null if '\N'],
    startedDateTime,
    TTFB [null if '\N'],
    renderStart,
    onContentLoaded [null if '\N'],
    onLoad, fullyLoaded, visualComplete, PageSpeed [null if '\N'], SpeedIndex [null if '\N'],
    rank [null if '\N'], reqTotal, reqHtml, reqJS, reqCSS,
    reqImg, reqGif, reqJpg, reqPng, reqFont,
    reqFlash, reqJson, reqOther, bytesTotal, bytesHtml,
    bytesJS, bytesCSS, bytesImg, bytesGif, bytesJpg,
    bytesPng, bytesFont, bytesFlash, bytesJson, bytesOther,
    bytesHtmlDoc, numDomains, maxDomainReqs, numRedirects, numHttps,
    numGlibs, numErrors, numCompressed, numDomElements, maxageNull,
    maxage0, maxage1, maxage30, maxage365, maxageMore,
    gzipTotal, gzipSavings, _connections, _adult_site, avg_dom_depth,
    document_height, document_width, localstorage_size, sessionstorage_size, num_iframes,
    num_scripts, doctype, meta_viewport)

    INTO postgresql:///pgloader?tablename=t80
         (
          id bigint using (identity pageid),
          doctype
         )

    WITH
        fields optionally enclosed by '"',
        fields escaped by '\',
        fields terminated by ',',
        csv escape mode following

    BEFORE LOAD DO
      $$ drop table if exists t80; $$,
      $$ CREATE TABLE t80
         (
            id        serial primary key,
            doctype   text
         );
      $$

;

"27955878","1434380476","All","Jun 15 2015","329","150615_0_D0M","3","http://www.flydubai.com/","http://www.flydubai.com/","9211","Incapsula","1434376648","1543","6326",\N,"11959","12813","11600","80","8854","8521","119","8","42","13","53","10","4","38","0","0","0","2","661291","16571","320664","54020","270035","7815","200089","60725","0","0","0","0","15555","17","99","8","0","0","5","50","1007","0","49","69","1","0","0","409245","0","33","0","11","1317","1008","69","37","2","55","HTML","","0","0","1","0","0","0","0","0","1","0","0","0"
"27955879","1434380476","All","Jun 15 2015","329","150615_0_D2D","1","http://www.seb.se/","http://www.seb.se/","2702",\N,"1434376676","1317","6470",\N,"10319","15431","9300","64","7668","8578","31","2","10","1","12","2","7","1","4","0","0","1","917326","36574","247483","297144","214183","89","208214","3199","121942","0","0","0","36354","6","24","3","0","1","3","2","382","0","14","3","10","3","1","587035","434622","10","0","9","2091","1008","0","0","0","8","html","width=device-width, initial-scale=1.0","0","0","1","0","0","1","0","0","0","0","0","387"
"27955880","1434380476","All","Jun 15 2015","329","150615_0_CW1","3","http://www.teletalk.com.bd/","http://www.teletalk.com.bd/","9502",\N,"1434376510","712","3808","904","10953","11564","13600","72","5541","8374","46","4","0","1","39","1","36","1","0","2","0","0","708768","62576","0","13165","266762","2419","245978","18071","0","366265","0","0","484","1","46","0","1","0","0","0","8","0","46","0","0","0","0","75752","66373","44","0","2","698","1008","0","0","0","0","HTML -//W3C//DTD HTML 4.01 Frameset//EN\\","","0","2","0","0","0","0","0","366265","0","0","0","0"
"27955881","1434380476","All","Jun 15 2015","329","150615_0_D2J","2","http://www.helpster.de/","http://www.helpster.de/","59167",\N,"1434376702","388","2935","5607","6922","8640","4100","90","2984","8583","54","3","14","1","30","4","18","8","2","0","0","4","492958","74870","153806","10578","228001","156","184659","43186","25700","0","0","3","74512","14","18","2","1","0","1","14","644","0","18","3","2","31","0","257661","6511","22","0","10","4295","1112","90","0","1","26","html","width=device-width, initial-scale=1","0","0","0","0","0","0","0","0","0","0","0","0"
"27955882","1434380476","All","Jun 15 2015","329","150615_0_CW6","1","http://www.namepros.com/","http://www.namepros.com/","32025","Cloudflare","1434376387","2323","5314",\N,"8087","18242","12500","92","6179","8379","93","8","30","3","48","6","24","17","1","0","0","3","933198","80301","579744","25109","165484","248","129691","35284","82560","0","0","0","35788","26","47","1","1","0","92","40","2548","0","16","11","26","38","2","705183","0","38","0","11","3655","1008","0","0","7","34","html","width=device-width, initial-scale=1","0","0","0","0","0","0","0","0","0","0","0","0"

And here's what we get as a result of pgloader doing its magic:

table t80;
    id    |                 doctype                  
----------+------------------------------------------
 27955878 | HTML
 27955879 | html
 27955880 | HTML -//W3C//DTD HTML 4.01 Frameset//EN\
 27955881 | html
 27955882 | html
(5 rows)
dimitri commented 9 years ago

Also note that it should work with your current version of pgloader and the following settings:

    WITH
        fields optionally enclosed by '"',
        fields escaped by '\',
        fields terminated by ','
Themanwithoutaplan commented 9 years ago

escaped by '\', replaces fields escaped by backslash-quote? csv escape mode following definitely seems required.