Closed Themanwithoutaplan closed 9 years ago
I've been using brew
to install SBCL, I don't know about MacPorts. The error you have is because you're using a not-so-recent version of SBCL where threads are not enabled by default, so that you need to compile SBCL with the following options: --with-sb-thread --with-sb-core-compression
.
MacPorts is similar to, but older than Homebrew. As the name suggests, it has similarity with the BSD ports system: https://www.macports.org
As I've never had any problems with it, I've never seen the need to switch to the new hotness.
From the info:
bash-4.3# port info sbcl@fancy
sbcl @1.2.0 (lang)
Variants: fancy, html, pdf, threads
Description: Steel Bank Common Lisp (SBCL) is a Open Source implementation of ANSI Common Lisp. It provides an interactive environment including an integrated
native compiler, interpreter, and debugger. SBCL is quite suitable for the creation of long running system services, as the compiler trades the
time for a long initial compilation for blazingly fast loading of its binary runtime fasl representation.
Homepage: http://www.sbcl.org
Platforms: darwin
License: BSD
Unfortunately, there seems to be a problem building the +threads
variant.
Here's what I have with brew:
~ brew info sbcl
sbcl: stable 1.2.0 (bottled), HEAD
http://www.sbcl.org/
/usr/local/Cellar/sbcl/1.1.1 (269 files, 50M)
/usr/local/Cellar/sbcl/1.1.4 (273 files, 50M)
/usr/local/Cellar/sbcl/1.1.10 (281 files, 52M)
Poured from bottle
/usr/local/Cellar/sbcl/1.1.14 (45 files, 50M)
Poured from bottle
/usr/local/Cellar/sbcl/1.1.17 (45 files, 52M)
Poured from bottle
/usr/local/Cellar/sbcl/1.1.18 (45 files, 51M)
Poured from bottle
/usr/local/Cellar/sbcl/1.2.0 (47 files, 52M) *
Poured from bottle
From: https://github.com/Homebrew/homebrew/commits/master/Library/Formula/sbcl.rb
==> Options
--32-bit
Build 32-bit only
--with-internal-xref
Include XREF information for SBCL internals (increases core size by 5-6MB)
--with-ldb
Include low-level debugger in the build
--without-core-compression
Build SBCL without support for compressed cores and without a dependency on zlib
--without-threads
Build SBCL without support for native threads
--HEAD
install HEAD version
So the default are reversed basically. Building SBCL on the mac works well for me, too.
Did you try http://pgloader.io/files/pgloader-3.0.99.pkg ?
I was using the package but, seeing as I wanted to try the improvements on handling camelCase syntax, I thought I'd have to build from source. Just giving the "threads" build another go, though I'm not sure if the library it wants to use is correct. The "fancy" build says it supports threading.
I'll be more than happy to use a binary if you can supply one. Is 3.0.99 being built off source?
Yes, there's a pkg
target in the main Makefile. If you want the very recent bug fixes and enhancement, it's best to learn how to build pgloader and have the right SBCL version. You could also try pgloader --self-upgrade /path/to/git/checkout/of/pgloader
, but I've been reported problems with that solution.
I wondered why there wasn't an install target. Well, seeing as it won't build with threads on my machine that's not much help. I'll file a bug report with the port maintainer.
It looks like self-upgrade isn't supported:
pgloader --self-upgrade src/pgloader
pgloader [ option ... ] command-file ...
--help -h boolean Show usage and exit.
--version -V boolean Displays pgloader version and exit.
--quiet -q boolean Be quiet
--verbose -v boolean Be verbose
--debug -d boolean Display debug level information.
--client-min-messages string Filter logs seen at the console (default: "warning")
--log-min-messages string Filter logs seen in the logfile (default: "notice")
--root-dir -D string Output root directory. (default: #P"//tmp/pgloader//")
--upgrade-config -U boolean Output the command(s) corresponding to .conf file for
v2.x
--list-encodings -E boolean List pgloader known encodings and exit.
--logfile -L string Filename where to send the logs.
--load -l string Read user code from file
Oh yeah you're right self-upgrade
if from later than 3.0.99, sorry about that. Try if http://pgloader.io/files/pgloader-3.0.ecd4f31.pkg works for you?
Thanks, syntax seems to be working okay. Now. So, the parser is taking care of mapping, say createDate
in the FROM
clause to "createDate"
in the INTO
? I could live with requiring both be quoted.
I'm now getting data related errors
2014-06-16T19:15:13.657000+02:00 ERROR Database error 22P02: invalid input syntax for integer: "\N"
Looking at a sample line that looks reasonable:
"16399104","1401620402","All","Jun 1 2014","226","140601_0_2","3","http://www.angege.com/","http://www.angege.com/","50017",\N,"1401615028","901","1333",\N,"2358","2533","2100","71","1574",\N,"29","6","1","1","21","4","16","0","0","0","0","0","32741","6445","6847","1863","17586","8847","7589","0","0","0","0","0","5165","5","22","3","2","0","0","7","348","0","28","0","1","0","0","14746","1418","26","0","16","942","1008","0","0","0","5","",""
I think that's MySQL speak for NULL which shouldn't actually be in a CSV. How do I escape/convert it?
You need to use the null if
option when declaring the CSV source field, as in the newly committed example at https://github.com/dimitri/pgloader/blob/master/test/csv-nulls.load.
LOAD CSV
FROM INLINE (id, number null if '\N', data)
INTO postgresql:///pgloader?nullif
BEFORE LOAD DO
$$ drop table if exists nullif; $$,
$$ CREATE TABLE nullif
(
id serial primary key,
number integer,
data text
);
$$
WITH fields terminated by ',',
fields enclosed by '"',
fields escaped by backslash-quote;
"1",\N,"testing nulls"
"2","2","another test"
That goes in the FROM
clause rather than a transformation function?
I'm nearly there, I think, just got a problem with an integer column that can be NULL
2014-06-16T20:25:55.132000+02:00 ERROR Database error 22P02: invalid input syntax for integer: "\N"
CONTEXT: COPY pages, line 3, column onContentLoaded: "\N"
My current script FWIW
LOAD CSV
FROM 'httparchive_Jun_1_2014_pages.csv' (
pageid, createDate, archive, label, crawlid,
wptid, wptrun, url, urlShort, urlhash,
cdn null if '\N', startedDateTime, TTFB, renderStart, onContentLoaded null if '\N',
onLoad, fullyLoaded, visualComplete, PageSpeed, SpeedIndex null if '\N',
rank null if '\N', reqTotal, reqHtml, reqJS, reqCSS,
reqImg, reqGif, reqJpg, reqPng, reqFont,
reqFlash, reqJson, reqOther, bytesTotal, bytesHtml,
bytesJS, bytesCSS, bytesImg, bytesGif, bytesJpg,
bytesPng, bytesFont, bytesFlash, bytesJson, bytesOther,
bytesHtmlDoc, numDomains, maxDomainReqs, numRedirects, numHttps,
numGlibs, numErrors, numCompressed, numDomElements, maxageNull,
maxage0, maxage1, maxage30, maxage365, maxageMore,
gzipTotal, gzipSavings, _connections, _adult_site, avg_dom_depth,
document_height, document_width, localstorage_size, sessionstorage_size, num_iframes,
num_scripts, doctype, meta_viewport)
INTO postgresql://postgres@localhost/http?pages (
pageid, "createDate", archive, label, wptid, wptrun, url, cdn, "startedDateTime",
"TTFB", "renderStart", "onContentLoaded", "onLoad", "fullyLoaded", "visualComplete", "PageSpeed",
"speedIndex", rank, "reqTotal", "reqHTML", "reqJS", "reqCSS", "reqImg", "reqGIF", "reqJPG",
"reqPNG", "reqFont", "reqFlash", "reqJSON", "reqOther", "bytesTotal", "bytesHTML", "bytesJS",
"bytesCSS", "bytesImg", "bytesGIF", "bytesJPG", "bytesPNG", "bytesFont", "bytesFlash", "bytesJSON",
"bytesHTMLDoc",
"bytesOther", "numDomains", "maxDomainReqs", "numRedirects", "numHTTPS", "numGlibs", "numErrors",
"numCompressed", "numDomElements", "maxageNull", "maxage0", "maxage1", "maxage30", "maxage365",
"maxageMore", "gzipTotal", "gzipSavings"
)
WITH
fields optionally enclosed by '"',
fields escaped by double-quote,
fields terminated by ','
SET
work_mem to '128MB',
standard_conforming_strings to 'on';
The data is from http://www.httparchive.org/downloads.php so gzip support would be nice. Thanks very much for your help!
Duh! Silly me - the important line is fields escaped by backslash-quote
. Seems to be working now, thanks.
There is still a problem for integer fields which can accept NULL.
debugger invoked on a SIMPLE-TYPE-ERROR in thread
#<THREAD "lparallel" RUNNING {1007F86A13}>:
Argument Y is not a NUMBER: NIL
I will need more information on that one, if possible with a small amount of data to reproduce the problem, or at least the --debug
or --debug --verbose
levels of detail in the output.
Sure. Just as soon as I figure out how to post a .gz to this issue!
As it seems only image uploads are supported, I renamed it to .jpg. gzip should still be able to unpack it, I hope.
I could reproduce the "\N" error message with the following test data:
"1",\N,"testing nulls"
"2","2","another test"
"3",\\N,"failing test"
The third line here gives me:
2014-06-17T11:37:01.204000+02:00 ERROR Database error 22P02: invalid input syntax for integer: "\\N"
CONTEXT: COPY nullif, line 3, column number: "\\N"
So I think your input file is erroneous, and as you hinted at the beginning of this issue you will need a specific transformation function to tell pgloader that really in your case \N
and \\N
are the same thing. Something like the following, to put in the INTO clause:
"speedIndex" integer using (unless (or (null speedIndex) (string-equal speedIndex "\\N") (string-equal speedIndex "\\\\N")) speedIndex)
Now about the Argument Y is not a NUMBER: NIL
, I don't see where it's coming from in your traces.
The log was from a run where I just excluded "onContentLoaded" so it runs faster. I'm not sure if the double-quoting isn't erroneous.
I'm working with this data file: http://www.archive.org/download/httparchive_downloads_Jun_1_2014/httparchive_Jun_1_2014_pages.csv.gz
I've just done another run including "onContentLoaded" this is a sample from the log
2014-06-17T11:33:48.873000+02:00 DATA < #("16399123" "1401620403" "All" "Jun 1 2014" "140601_0_1C" "3"
"http://www.findaproperty.com/" NIL "1401615051" "745" "1556" "8344" "8470"
"8100" "92" "2039" NIL "46" "6" "14" "3" "18" "6" "3" "8" "3" "0" "0" "2"
"443862" "39543" "277460" "32365" "31629" "1389" "2923" "25911" "62865" "0"
"0" "15297" "\\N" "0" "22" "15" "3" "0" "2" "2" "24" "3456" "0" "16" "7"
"1" "7" "15" "424764" "0")
And the corresponding entry from the file:
"16399123","1401620403","All","Jun 1 2014","226","140601_0_1C","3","http://www.findaproperty.com/","http://www.findaproperty.com/","8722",\N,"1401615051","745","1556",\N,"8344","8470","8100","92","2039",
You can see that some of the \Ns are being converted to nil but not all which probably means I've got the positioning wrong. The data in the log - does it correspond to what is read from the file or what is going to be passed to COPY
?
I've double-checked the files. As the problem seems related to "onContentLoaded" and "SpeedIndex" only I suspect that there is a mismatch in field name matching because of the quoting. When I quote the two fields in the FROM
clause the cast happens!
So, now the import is working but only seems to be importing a few rows:
sb-impl::*default-external-format* :UTF-8
tmpdir: #P"/var/folders/q0/r5j5dzp55p734_sdh1cblxn80000gn/T/"
2014-06-17T12:24:37.036000+02:00 LOG Starting pgloader, log system is ready.
2014-06-17T12:24:37.055000+02:00 INFO Starting monitor
2014-06-17T12:24:37.063000+02:00 LOG Main logs in '/private/tmp/pgloader/pgloader.log'
2014-06-17T12:24:37.064000+02:00 LOG Data errors in '/private/tmp/pgloader/'
2014-06-17T12:24:37.064000+02:00 LOG Parsing commands from file #P"/Users/charlieclark/Sites/http/httparchive.load"
2014-06-17T12:24:37.265000+02:00 DEBUG CONNECT
2014-06-17T12:24:37.265000+02:00 DEBUG SET work_mem TO '128MB'
2014-06-17T12:24:37.265000+02:00 DEBUG BEGIN
2014-06-17T12:24:37.265000+02:00 DEBUG CONNECT
2014-06-17T12:24:37.265000+02:00 DEBUG SET work_mem TO '128MB'
2014-06-17T12:24:37.265000+02:00 DEBUG BEGIN
2014-06-17T12:24:37.265000+02:00 NOTICE delete from pages
where label = 'Jun 1 2014';
2014-06-17T12:24:48.092000+02:00 NOTICE COPY http.pages
2014-06-17T12:24:48.093000+02:00 DEBUG pgsql:copy-from-queue: pages ("pageid" "createDate" "archive" "label" "wptid"
"wptrun" "url" "cdn" "startedDateTime" "TTFB"
"renderStart" "onContentLoaded" "onLoad"
"fullyLoaded" "visualComplete" "PageSpeed"
"SpeedIndex" "rank" "reqTotal" "reqHTML" "reqJS"
"reqCSS" "reqImg" "reqGIF" "reqJPG" "reqPNG"
"reqFont" "reqFlash" "reqJSON" "reqOther"
"bytesTotal" "bytesHTML" "bytesJS" "bytesCSS"
"bytesImg" "bytesGIF" "bytesJPG" "bytesPNG"
"bytesFont" "bytesFlash" "bytesJSON"
"bytesHTMLDoc" "bytesOther" "numDomains"
"maxDomainReqs" "numRedirects" "numHTTPS"
"numGlibs" "numErrors" "numCompressed"
"numDomElements" "maxageNull" "maxage0" "maxage1"
"maxage30" "maxage365" "maxageMore" "gzipTotal"
"gzipSavings")
2014-06-17T12:24:48.094000+02:00 INFO COPY FROM #P"/Users/charlieclark/Sites/http/httparchive_Jun_1_2014_pages.csv"
2014-06-17T12:24:48.094000+02:00 DEBUG CONNECT ("http" "postgres" NIL "localhost" :PORT 5432)
2014-06-17T12:24:48.094000+02:00 DEBUG SET work_mem TO '128MB'
2014-06-17T12:24:50.754000+02:00 ERROR :WAITING-FOR-NEXT fell through ECASE expression.
Wanted one of (:COLLECTING :COLLECTING-QUOTED :WAITING).
2014-06-17T12:24:50.755000+02:00 DEBUG BEGIN
2014-06-17T12:24:50.755000+02:00 DEBUG SET LOCAL work_mem TO '128MB'
table name read imported errors time
------------------------------ --------- --------- --------- --------------
before load 1 1 0 10.793s
------------------------------ --------- --------- --------- --------------
pages 17255 17255 1 4.633s
------------------------------ --------- --------- --------- --------------
Total import time 17255 17255 1 15.426s
2014-06-17T12:24:52.764000+02:00 DEBUG copy-batch pages 17255 rows
2014-06-17T12:24:52.764000+02:00 INFO Stopping monitor
There should be about 290000 entries or am I reading this wrong?
I'm quite surprised about the case issue, in particular because pgloader uses the following function to go from strings in the parsed command file to internal symbols to destructure against. Note the usage of the string-upcase
fonction.
(defun intern-symbol (symbol-name)
(intern (string-upcase symbol-name)
(find-package "PGLOADER.TRANSFORMS")))
The logs you have indicates that it did only one batch processing, from the fact that there's a single DEBUG copy-batch
message.
Can't say much about LISP but there's obviously a mismatch between one mapping and another. It associates the INTO
and FROM
fields fine but trips up over the need to cast \N to Nil.
How do I get it to do more batches? I tried playing around with batch rows =
settings to no avail.
Also, might it make sense to always quote column names to Postgres?
I think I've found the reason why only one batch is running. There are some lines which have double-backslashes in them. Removing the backslashes and the import runs past these lines:
"16417153","1401640227","All","Jun 1 2014","226","140601_0_JYY","2","http://www.teletalk.com.bd/","http://www.teletalk.com.bd/","9502",\N,"1401635523","655","3314","803","10252","10918","13000","88","5451","15372","41","5","0","1","34","2","31","1","0","1","0","0","2165774","49171","0","13165","2101237","282029","1790568","28640","0","2201","0","0","484","1","41","0","1","0","0","0","8","0","41","0","0","0","0","64526","54067","39","0","2","698","1008","0","0","0","0","HTML -//W3C//DTD HTML 4.01 Frameset//EN\\",""
Shouldn't that be raising an exception rather than just grinding to a halt? Better still, it should be being parsed. :-)
I think I just saw the problem with the case sensibility for applying null if rules, thanks for insisting. Will inquire tomorrow about double slashes.
FWIW I filed a bug upstream about the "\" https://github.com/HTTPArchive/httparchive/issues/25 which shouldn't really be happening.
A quick note on building OS X - there seems to be a problem building SBCL with threads on Mavericks if there are any non-ASCII characters in the file listing of the root directory. See https://trac.macports.org/ticket/40980 for details and a workaround. From what I can tell the Homebrew guys are basing their formula on MacPorts so I wouldn't be surprised if it's an issue there as well. Of course, as long as the binary supports threads you won't notice.
By the way, it doesn't appear that the backslashes are a problem in the CSV parsing:
CL-USER> (with-open-file (s "/Users/dim/dev/temp/80.csv")
(cl-csv:read-csv s
:separator #\,
:quote #\"
:escape "\"\""))
(("16417153" "1401640227" "All" "Jun 1 2014" "226" "140601_0_JYY" "2"
"http://www.teletalk.com.bd/" "http://www.teletalk.com.bd/" "9502" "\\N"
"1401635523" "655" "3314" "803" "10252" "10918" "13000" "88" "5451" "15372"
"41" "5" "0" "1" "34" "2" "31" "1" "0" "1" "0" "0" "2165774" "49171" "0"
"13165" "2101237" "282029" "1790568" "28640" "0" "2201" "0" "0" "484" "1"
"41" "0" "1" "0" "0" "0" "8" "0" "41" "0" "0" "0" "0" "64526" "54067" "39"
"0" "2" "698" "1008" "0" "0" "0" "0"
"HTML -//W3C//DTD HTML 4.01 Frameset//EN\\\\" ""))
Note that given the UI in use, we expect backslashes to appear doubled here.
I know it's a mess but how do I work around it?
Now that I'm able to build pgloader I'm able to try stuff out on my machine. Nice to see the null if
rules working. Would it be possible to support WITH quote identifiers
for CSV imports as is possible when importing from MySQL?
I'm not sure what you mean. You can already just quote the column names, right?
Sure, but seeing as there is already WITH quote identifiers
when copying from MySQL I thought it would be reasonable to be able to use it here. Lowercasing camelCase field names is bound to trip people up.
The difference is that the MySQL source provides the column names whereas with a CSV source the command itself has to provide for the column names, so that I don't think quote identifiers makes sense in that context.
I managed to finally reproduce your error case with the following SQL table definition and reworked load file (had to fix a number of column spellings so that the case would match PostgreSQL column case):
CREATE TABLE pages (
"pageid" serial ,
"createDate" integer ,
"archive" varchar(16) ,
"label" varchar(32) ,
"crawlid" integer ,
"wptid" varchar(64) ,
"wptrun" smallint ,
"url" text,
"urlShort" varchar(255) ,
"urlhash" smallint ,
"cdn" varchar(64) ,
"startedDateTime" integer ,
"TTFB" integer ,
"renderStart" integer ,
"onContentLoaded" integer ,
"onLoad" integer ,
"fullyLoaded" integer ,
"visualComplete" integer ,
"PageSpeed" smallint ,
"SpeedIndex" integer ,
"rank" integer ,
"reqTotal" smallint ,
"reqHtml" smallint ,
"reqJS" smallint ,
"reqCSS" smallint ,
"reqImg" smallint ,
"reqGif" smallint ,
"reqJpg" smallint ,
"reqPng" smallint ,
"reqFont" smallint ,
"reqFlash" smallint ,
"reqJson" smallint ,
"reqOther" smallint ,
"bytesTotal" integer ,
"bytesHtml" integer ,
"bytesJS" integer ,
"bytesCSS" integer ,
"bytesImg" integer ,
"bytesGif" integer ,
"bytesJpg" integer ,
"bytesPng" integer ,
"bytesFont" integer ,
"bytesFlash" integer ,
"bytesJson" integer ,
"bytesOther" integer ,
"bytesHtmlDoc" integer ,
"numDomains" integer ,
"maxDomainReqs" integer ,
"numRedirects" integer ,
"numErrors" integer ,
"numGlibs" integer ,
"numHttps" integer ,
"numCompressed" integer ,
"numDomElements" integer ,
"maxageNull" integer ,
"maxage0" integer ,
"maxage1" integer ,
"maxage30" integer ,
"maxage365" integer ,
"maxageMore" integer ,
"gzipTotal" integer ,
"gzipSavings" integer ,
"_connections" smallint ,
"_adult_site" smallint ,
"avg_dom_depth" smallint ,
"document_height" integer ,
"document_width" integer ,
"localstorage_size" integer ,
"sessionstorage_size" integer ,
"num_iframes" smallint ,
"num_scripts" smallint ,
"doctype" varchar(255) ,
"meta_viewport" varchar(255) ,
PRIMARY KEY (pageid),
UNIQUE ("label","urlShort")
);
And
LOAD CSV
FROM 'httparchive_Jun_1_2014_pages.csv'
(
pageid, createDate, archive, label, crawlid,
wptid, wptrun, url, urlShort, urlhash,
cdn null if '\N', startedDateTime, TTFB, renderStart, onContentLoaded null if '\N',
onLoad, fullyLoaded, visualComplete, PageSpeed, SpeedIndex null if '\N',
rank null if '\N', reqTotal, reqHtml, reqJS, reqCSS,
reqImg, reqGif, reqJpg, reqPng, reqFont,
reqFlash, reqJson, reqOther, bytesTotal, bytesHtml,
bytesJS, bytesCSS, bytesImg, bytesGif, bytesJpg,
bytesPng, bytesFont, bytesFlash, bytesJson, bytesOther,
bytesHtmlDoc, numDomains, maxDomainReqs, numRedirects, numHttps,
numGlibs, numErrors, numCompressed, numDomElements, maxageNull,
maxage0, maxage1, maxage30, maxage365, maxageMore,
gzipTotal, gzipSavings, _connections, _adult_site, avg_dom_depth,
document_height, document_width, localstorage_size, sessionstorage_size, num_iframes,
num_scripts, doctype, meta_viewport
)
INTO postgresql:///pgloader?pages
(
pageid, "createDate", archive, label, wptid, wptrun, url, cdn, "startedDateTime",
"TTFB", "renderStart", "onContentLoaded", "onLoad", "fullyLoaded", "visualComplete", "PageSpeed",
"SpeedIndex", rank, "reqTotal", "reqHtml", "reqJS", "reqCSS", "reqImg", "reqGif", "reqJpg",
"reqPng", "reqFont", "reqFlash", "reqJson", "reqOther", "bytesTotal", "bytesHtml", "bytesJS",
"bytesCSS", "bytesImg", "bytesGif", "bytesJpg", "bytesPng", "bytesFont", "bytesFlash", "bytesJson",
"bytesHtmlDoc",
"bytesOther", "numDomains", "maxDomainReqs", "numRedirects", "numHttps", "numGlibs", "numErrors",
"numCompressed", "numDomElements", "maxageNull", "maxage0", "maxage1", "maxage30", "maxage365",
"maxageMore", "gzipTotal", "gzipSavings"
)
WITH
truncate,
fields optionally enclosed by '"',
fields escaped by backslash-quote,
fields terminated by ','
SET
work_mem to '128MB',
standard_conforming_strings to 'on';
The bug is a cl-csv bug where the parser can't continue doing its job, so pgloader stops right at the problem:
2014-07-21T14:57:50.001000+02:00 LOG Starting pgloader, log system is ready.
2014-07-21T14:57:50.002000+02:00 LOG Parsing commands from file #P"/Users/dim/dev/temp/pgloader-issue-80.load"
2014-07-21T14:57:52.364000+02:00 ERROR :WAITING-FOR-NEXT fell through ECASE expression.
Wanted one of (:COLLECTING :COLLECTING-QUOTED :WAITING).
table name read imported errors time
------------------------------ --------- --------- --------- --------------
pages 17255 17255 1 2.560s
------------------------------ --------- --------- --------- --------------
Total import time 17255 17255 1 2.560s
As soon as we manage to fix the bug in cl-csv the pgloader run should be fine.
I hit this issue too. This issue makes the tool fairly useless for me.
Do you happen to have a simpler test case? It would help coming towards a solution, IIRC the necessary bits have been added to cl-csv already in the meantime...
Don't know about the SBCL part but I have a workaround where I pipe the file through sed to strip the offending characters: gzip -d -c src.gz | sed 's/\\\\//' | pgloader conf.load
I would like that answer to be good enough to close the issue but I would prefer a very easy .load file with INLINE data to reproduce and fix the problem, of course...
I'm still using the same data set with the same crap line about 11,400 lines in. I'll see if I can distill into a test case.
You should be able to reproduce the error with using this https://gist.github.com/Themanwithoutaplan/d3c8261b406afea9fe84
Thanks for your reduced test-case, and thanks to cl-csv new escape mode, the bug seems fixed to me.
Here's the test-case I've been using:
LOAD CSV
FROM inline (
pageid, createDate, archive, label, crawlid,
wptid, wptrun, url, urlShort, urlhash,
cdn [null if '\N'],
startedDateTime,
TTFB [null if '\N'],
renderStart,
onContentLoaded [null if '\N'],
onLoad, fullyLoaded, visualComplete, PageSpeed [null if '\N'], SpeedIndex [null if '\N'],
rank [null if '\N'], reqTotal, reqHtml, reqJS, reqCSS,
reqImg, reqGif, reqJpg, reqPng, reqFont,
reqFlash, reqJson, reqOther, bytesTotal, bytesHtml,
bytesJS, bytesCSS, bytesImg, bytesGif, bytesJpg,
bytesPng, bytesFont, bytesFlash, bytesJson, bytesOther,
bytesHtmlDoc, numDomains, maxDomainReqs, numRedirects, numHttps,
numGlibs, numErrors, numCompressed, numDomElements, maxageNull,
maxage0, maxage1, maxage30, maxage365, maxageMore,
gzipTotal, gzipSavings, _connections, _adult_site, avg_dom_depth,
document_height, document_width, localstorage_size, sessionstorage_size, num_iframes,
num_scripts, doctype, meta_viewport)
INTO postgresql:///pgloader?tablename=t80
(
id bigint using (identity pageid),
doctype
)
WITH
fields optionally enclosed by '"',
fields escaped by '\',
fields terminated by ',',
csv escape mode following
BEFORE LOAD DO
$$ drop table if exists t80; $$,
$$ CREATE TABLE t80
(
id serial primary key,
doctype text
);
$$
;
"27955878","1434380476","All","Jun 15 2015","329","150615_0_D0M","3","http://www.flydubai.com/","http://www.flydubai.com/","9211","Incapsula","1434376648","1543","6326",\N,"11959","12813","11600","80","8854","8521","119","8","42","13","53","10","4","38","0","0","0","2","661291","16571","320664","54020","270035","7815","200089","60725","0","0","0","0","15555","17","99","8","0","0","5","50","1007","0","49","69","1","0","0","409245","0","33","0","11","1317","1008","69","37","2","55","HTML","","0","0","1","0","0","0","0","0","1","0","0","0"
"27955879","1434380476","All","Jun 15 2015","329","150615_0_D2D","1","http://www.seb.se/","http://www.seb.se/","2702",\N,"1434376676","1317","6470",\N,"10319","15431","9300","64","7668","8578","31","2","10","1","12","2","7","1","4","0","0","1","917326","36574","247483","297144","214183","89","208214","3199","121942","0","0","0","36354","6","24","3","0","1","3","2","382","0","14","3","10","3","1","587035","434622","10","0","9","2091","1008","0","0","0","8","html","width=device-width, initial-scale=1.0","0","0","1","0","0","1","0","0","0","0","0","387"
"27955880","1434380476","All","Jun 15 2015","329","150615_0_CW1","3","http://www.teletalk.com.bd/","http://www.teletalk.com.bd/","9502",\N,"1434376510","712","3808","904","10953","11564","13600","72","5541","8374","46","4","0","1","39","1","36","1","0","2","0","0","708768","62576","0","13165","266762","2419","245978","18071","0","366265","0","0","484","1","46","0","1","0","0","0","8","0","46","0","0","0","0","75752","66373","44","0","2","698","1008","0","0","0","0","HTML -//W3C//DTD HTML 4.01 Frameset//EN\\","","0","2","0","0","0","0","0","366265","0","0","0","0"
"27955881","1434380476","All","Jun 15 2015","329","150615_0_D2J","2","http://www.helpster.de/","http://www.helpster.de/","59167",\N,"1434376702","388","2935","5607","6922","8640","4100","90","2984","8583","54","3","14","1","30","4","18","8","2","0","0","4","492958","74870","153806","10578","228001","156","184659","43186","25700","0","0","3","74512","14","18","2","1","0","1","14","644","0","18","3","2","31","0","257661","6511","22","0","10","4295","1112","90","0","1","26","html","width=device-width, initial-scale=1","0","0","0","0","0","0","0","0","0","0","0","0"
"27955882","1434380476","All","Jun 15 2015","329","150615_0_CW6","1","http://www.namepros.com/","http://www.namepros.com/","32025","Cloudflare","1434376387","2323","5314",\N,"8087","18242","12500","92","6179","8379","93","8","30","3","48","6","24","17","1","0","0","3","933198","80301","579744","25109","165484","248","129691","35284","82560","0","0","0","35788","26","47","1","1","0","92","40","2548","0","16","11","26","38","2","705183","0","38","0","11","3655","1008","0","0","7","34","html","width=device-width, initial-scale=1","0","0","0","0","0","0","0","0","0","0","0","0"
And here's what we get as a result of pgloader doing its magic:
table t80;
id | doctype
----------+------------------------------------------
27955878 | HTML
27955879 | html
27955880 | HTML -//W3C//DTD HTML 4.01 Frameset//EN\
27955881 | html
27955882 | html
(5 rows)
Also note that it should work with your current version of pgloader and the following settings:
WITH
fields optionally enclosed by '"',
fields escaped by '\',
fields terminated by ','
escaped by '\',
replaces fields escaped by backslash-quote
? csv escape mode following definitely seems required.
I've managed to build on Mac OS using SBCL from MacPorts (COMPRESS_CORE=NO) but I still can't use pgloader: