dimitri / pgloader

Migrate to PostgreSQL in a single command!
http://pgloader.io
Other
5.45k stars 548 forks source link

How to CSV load casting to timestamp from unix "seconds since epoch" #527

Closed malcook closed 7 years ago

malcook commented 7 years ago

Can I import a column of unix file timestamps, encoded as seconds since epoch (such as produced by stat --format "%Y" *.txt), into postgresql timestampz column using --cast?

I've tried --cast column atime to timestamp using unix-timestamp-to-timestampz but I get errors.

Perhaps I'm mistaken in expecting --cast to work with csv at all?

I look forward the the possibility of using pgloader toward this end, with your advice. Thanks

dimitri commented 7 years ago

You're right that the CAST operator isn't available for CSV file types. That's because CSV doesn't offer a way to specify the data type of its content, so we don't know what we cast from really. Please have a look at the following example, that shows how to write transformation function calls into the CSV syntax:

https://github.com/dimitri/pgloader/blob/master/test/csv-before-after.load

malcook commented 7 years ago

Ah, I see. Unexpressible from command line but possible with the commandFile. Thanks for the pointer,

And for pgloader,

~Malcolm

bong0 commented 6 years ago

@malcook could you post the command/load file you used, including the transformation? I'm still struggling with the syntax :/

malcook commented 6 years ago

I wound up stream editing the output of pwalk with a little perl

see https://github.com/fizwit/filesystem-reporting-tools/wiki/Postgresql-loading-practices

bong0 commented 6 years ago

@malcook thank you! Did this provide you a better import performance?

It actually worked out for me to use the built-in pgloader transformation but the speed is around 1000rows/second which is far too small for me but I have no clue why it's that slow even when I tune around the concurrency/batch options.

FTR: My working loadfile looks like this:

LOAD CSV
    FROM '/var/lib/spielwiese/dump.csv'
         WITH ENCODING UTF8
         (
            field2    [null if blanks],
            unixdate    [null if blanks],
            field3 [null if blanks]
         )
    INTO pgsql://gpadmin@localhost:5432/demo
    TARGET TABLE flightpaths.points
         (
        date timestamp using (unix-timestamp-to-timestamptz unixdate),
            field2,
                field3
         )
    WITH truncate,
    skip header = 1,
    fields optionally enclosed by '"',
        fields terminated by ',',
        disable triggers,
    on error stop,
    drop indexes;
malcook commented 6 years ago

I am not currently using pwalk and have no performance stats to share. Good luck....

malcolm.cook@stowers.org

On Oct 23, 2018 5:14 AM, bongo notifications@github.com wrote: CAUTION: Non-Stowers email

@malcookhttps://github.com/malcook thank you! Did this provide you a better import performance?

It actually worked out for me to use the built-in pgloader transformation but the speed is around 1000rows/second which is far too small for me but I have no clue why it's that slow even when I tune around the concurrency/batch options.

FTR: My working loadfile looks like this:

LOAD CSV FROM '/var/lib/spielwiese/dump.csv' WITH ENCODING UTF8 ( field2 [null if blanks], unixdate [null if blanks], field3 [null if blanks] ) INTO pgsql://gpadmin@localhost:5432/demo TARGET TABLE flightpaths.points ( date timestamp using (unix-timestamp-to-timestamptz unixdate), field2, field3 ) WITH truncate, skip header = 1, fields optionally enclosed by '"', fields terminated by ',', disable triggers, on error stop, drop indexes;

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dimitri/pgloader/issues/527#issuecomment-432186678, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAdjulLKAI9asMHFSLewBksHUaFFfCBZks5unuv_gaJpZM4MaSbU.