dimitri / pgloader

Migrate to PostgreSQL in a single command!
http://pgloader.io
Other
5.45k stars 548 forks source link

Importer stuck at certain table #337

Closed pathikrit closed 7 years ago

pathikrit commented 8 years ago

My config:

LOAD database
  FROM  mysql://dev:dev@localhost/main
  INTO  postgresql://dev@localhost/main
  WITH  include no drop, create tables, no truncate, create indexes, reset sequences, foreign keys
  SET   maintenance_work_mem to '128MB', work_mem to '12MB'
  CAST type datetime to timestamptz drop default drop not null using zero-dates-to-null,
       type date drop not null drop default using zero-dates-to-null
  BEFORE LOAD DO $$ CREATE SCHEMA IF NOT EXISTS main; $$
;

Last few lines in log fie:

2016-02-04T14:18:01.924000-08:00 NOTICE CREATE INDEX idx_16684_fk_entity_types_render_field_id ON entity_types (render_field_id);
2016-02-04T14:18:01.924000-08:00 NOTICE CREATE INDEX idx_16684_fk_entity_types_sort_field_id ON entity_types (sort_field_id);
2016-02-04T14:18:01.924000-08:00 NOTICE CREATE UNIQUE INDEX idx_16684_name ON entity_types (name);
2016-02-04T14:18:01.924000-08:00 NOTICE CREATE UNIQUE INDEX idx_16684_primary ON entity_types (id);
2016-02-04T14:18:01.924000-08:00 NOTICE COPY events
2016-02-04T14:19:00.839000-08:00 NOTICE CREATE UNIQUE INDEX idx_16693_primary ON events (id);
2016-02-04T14:19:00.839000-08:00 NOTICE CREATE INDEX idx_16693_user ON events ("user");
2016-02-04T14:19:00.840000-08:00 NOTICE COPY field_audits
2016-02-04T14:19:22.322000-08:00 NOTICE CREATE INDEX idx_16702_entity_type_id ON field_audits (entity_type_id);
2016-02-04T14:19:22.322000-08:00 NOTICE CREATE INDEX idx_16702_field_id ON field_audits (field_id);
2016-02-04T14:19:22.322000-08:00 NOTICE CREATE INDEX idx_16702_fk_field_audits_context_fund_id ON field_audits (context_fund_id);
2016-02-04T14:19:22.324000-08:00 NOTICE CREATE INDEX idx_16702_fk_field_audits_fund_id ON field_audits (fund_id);
2016-02-04T14:19:22.324000-08:00 NOTICE CREATE INDEX idx_16702_fk_field_audits_impersonating_user_id ON field_audits (impersonating_user_id);
2016-02-04T14:19:22.324000-08:00 NOTICE CREATE INDEX idx_16702_last_field_update ON field_audits (entity_id, field_id, context_fund_id, update_time);
2016-02-04T14:19:22.324000-08:00 NOTICE CREATE UNIQUE INDEX idx_16702_primary ON field_audits (id);
2016-02-04T14:19:22.324000-08:00 NOTICE CREATE INDEX idx_16702_smartlist_id ON field_audits (smartlist_id);
2016-02-04T14:19:22.324000-08:00 NOTICE CREATE INDEX idx_16702_user ON field_audits ("user");
2016-02-04T14:19:22.324000-08:00 NOTICE CREATE INDEX idx_16702_view_id ON field_audits (view_id);
2016-02-04T14:19:22.729000-08:00 NOTICE COPY field_search_alias
2016-02-04T14:19:23.337000-08:00 NOTICE COPY files
mysql> select count(*) FROM files;
2071217

psql> select count(*) FROM files;
2050000

Its strange it is stuck at such a round number for 2 hours. Do I need to tweak some params? The total size of my mysql db is 20GB...

dimitri commented 8 years ago

Can you please run in debug mode so that you have the full details in the log? can you report what top and other tools are saying about the running process (is it still running, how much memory is it consuming, etc)?

pathikrit commented 8 years ago

I tried again with more memory:

LOAD database
  FROM  mysql://dev:dev@localhost/main
  INTO  postgresql://dev@localhost/main
  WITH  include drop, create tables, no truncate, create indexes, reset sequences, foreign keys
  SET   maintenance_work_mem to '512MB', work_mem to '128MB'
  CAST type datetime to timestamptz drop default drop not null using zero-dates-to-null,
       type date drop not null drop default using zero-dates-to-null
  BEFORE LOAD DO $$ CREATE SCHEMA IF NOT EXISTS main; $$
;

It is stuck at the same table for 8 hours with exactly 1200000 in postgres.

top is showing this for pgloader:

pgloader    1.15 GB 917.5 MB    69  421 6343    pbhowmick   96.0    17:03:40.76
dimitri commented 8 years ago

I need more information to understand the case... can you gather more?

pathikrit commented 8 years ago

Hi, since the debug output includes table names - I am going to delete the previous comment from public view. You should still have my previous comment in an email tied to your GitHub account.

dimitri commented 8 years ago

Can't see the email, did you send it explicitly or would it be a github notification? (I don't get those).

mnietz commented 8 years ago

I guess i have a similar behaviour.

I have a mysql-database of about 35GB size and in the first step i want to migrate only 12 (backup) Tables of about 4 GB to postgres using pgloader.

Migration stops at certain tables, i identified 3 out of 12 tables to be problematic. All the affected tables include columns with very long text-lines.

What i recognized so far is that when the problem occurs, pgloader does no longer use all my available cores, instead it is stuck to 1 core and utilize it to 100%. What's a bit strange is, that it is 50/50 user and system(?) load. When everything works fine i see only user-load most of the time. Stracing the Process-ID shows no activity.

Trying to run the process in debug ends up in a out-of-memory situation, so i wasn't able to generate a debug.log so far.

As a workaround i can do the migration table by table, which works w/o any issues.

I will see if i'm able to gather further informatione, ...

dimitri commented 8 years ago

It might be tied to the new improved support for very long text in the MySQL driver. Can you export a small(wsh) part of the problematic table for me to reproduce the bug locally?

mnietz commented 8 years ago

sadly not, this is data for tax-authorities and therefor confidential. I will check if i'm able to anonymize the important parts to give you a sample.

mnietz commented 8 years ago

@dimitri : Have you received my e-mail ? i send you a sample dataset to dim@tapoueh.org Was it helpful or do you need more information ?

dimitri commented 8 years ago

The email made it to my mailbox but I didn't have time to look at it. Now that I do, it's missing the foo table definition...

ERROR 1146 (42S02) at line 21 in file: 'foo.sql': Table 'issue337.foo' doesn't exist 
dimitri commented 8 years ago

Tried to reproduce the problem locally, and all went fine with your data set:

2016-02-15T18:35:48.001000+01:00 NOTICE Starting pgloader, log system is ready.
2016-02-15T18:35:48.615000+01:00 NOTICE DROP then CREATE TABLES
2016-02-15T18:35:48.615000+01:00 NOTICE DROP TABLE IF EXISTS foo CASCADE;
2016-02-15T18:35:48.615000+01:00 WARNING Postgres warning: table "foo" does not exist, skipping
2016-02-15T18:35:48.615000+01:00 NOTICE CREATE TABLE foo 
(
  id                     bigint not null,
  download_kennwort      varchar(30) not null,
  sende_info             text not null,
  program_version        text not null,
  sendeart               bigint not null,
  data                   text not null,
  creation_date          varchar(14) not null,
  creation_date_year     varchar(4) not null,
  creation_date_month    varchar(2) not null,
  creation_date_day      varchar(2) not null,
  creation_date_hour     varchar(2) not null,
  created_by             bigint not null,
  ticket_id              varchar(50) not null,
  origin                 text not null,
  origin_information     text not null,
  validation             bigint not null
);
2016-02-15T18:35:48.621000+01:00 NOTICE CREATE INDEX idx_1414771_cd_dk ON foo (creation_date_year, creation_date_month, creation_date_day, creation_date_hour, download_kennwort);
2016-02-15T18:35:48.621000+01:00 NOTICE CREATE UNIQUE INDEX idx_1414771_primary ON foo (id);
2016-02-15T18:35:48.621000+01:00 NOTICE CREATE INDEX idx_1414771_cd_v ON foo (creation_date, validation);
2016-02-15T18:35:48.621000+01:00 NOTICE CREATE INDEX idx_1414771_validation ON foo (validation);
2016-02-15T18:35:48.621000+01:00 NOTICE CREATE INDEX idx_1414771_ti_dk ON foo (ticket_id, download_kennwort);
2016-02-15T18:35:48.621000+01:00 NOTICE CREATE UNIQUE INDEX idx_1414771_ticket_id_unique ON foo (ticket_id);
2016-02-15T18:35:48.621000+01:00 NOTICE Reset sequences
2016-02-15T18:35:48.826000+01:00 NOTICE ALTER TABLE foo ADD PRIMARY KEY USING INDEX idx_1414771_primary;
2016-02-15T18:35:48.826000+01:00 NOTICE ALTER TABLE foo ADD PRIMARY KEY USING INDEX idx_1414771_primary;
2016-02-15T18:35:48.826000+01:00 NOTICE Comments

             table name       read   imported     errors      total time       read      write
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
        fetch meta data          7          7          0          0.311s                     
           create, drop          0          0          0          0.027s                     
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
                    foo          1          1          0          0.003s     0.008s    0.003s
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
COPY Threads Completion          3          3          0          0.042s                     
 Index Build Completion          6          6          0          0.071s                     
         Create Indexes          6          6          0          0.027s                     
        Reset Sequences          0          0          0          0.037s                     
           Primary Keys          1          1          0          0.007s                     
           Foreign Keys          0          0          0          0.000s                     
       Install comments          0          0          0          0.000s                     
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
      Total import time          1          1          0          0.827s     0.008s    0.003s
mnietz commented 8 years ago

Yes, the problem only occurs, when i try to migrate the whole database in one step. Doing the migration table-by-table works fine. So it might be related to the amount of data within a table or the amount of tables. When i find the time i will try to build a full (anonymized) demo dataset.

Thanks for your support.

dimitri commented 8 years ago

Can you play with concurrency settings (worker count and concurrency) in the mean time, or maybe just the batch size in terms of batch rows?

mnietz commented 8 years ago

Already did it, because otherwise pgloader uses way to much memory, starts swapping and needs to be killed. But if have to mention that the virtual instance i have available for the migration is pretty small, 4 cores, 4GB RAM.

This are my settings at the moment:

WITH batch rows = 1000,
     batch size = 2MB,
     include drop, create tables, create indexes, no foreign keys
dimitri commented 8 years ago

Please try with the latest patch. The patch deals better with some kind of errors that you might have here, even if it doesn't sound like it: those errors weren't properly handled at all before.

mnietz commented 8 years ago

Sorry for the late reply, i have had the same issue with the latest patch.

emailgregn commented 8 years ago

+1 I'm seeing this stuck behavior too. Load succeeds when changed to multiple table by table .load scripts using a bash loop and an include only directive.

INCLUDING ONLY TABLE NAMES MATCHING 'xxxyyyzzz' 
manuca commented 8 years ago

+1 Exactly same issue, works fine when including a single table in the load script, even when that table was the one where the whole process stuck previously when importing all the tables.

dimitri commented 8 years ago

Then there might still be something odd in the way pgloader waits for all table loading to be done, which is found at https://github.com/dimitri/pgloader/blob/master/src/sources/common/db-methods.lisp#L245 and following.

As debug mode seems way too verbose here, can you raise the log level on lines 250 and 263 to :log instead and see if you can make sense of the output, or share it here?

Another idea would be to see if there's still a CREATE INDEX command going on while pgloader seems stuck, as it will have to wait until all indexes are ready to reach the next step of completing the catalogs (constraints and comments). You can easily see any such active query in PostgreSQL with the following query:

select pid, waiting, state, query from pg_stat_activity where pid <> pg_backend_pid();
manuca commented 8 years ago

I just compiled the current master branch and the issue persists. I had the idea of going back in time until I found a commit that worked and then git bisect but couldn't complete the build when I checked out v3.1.1 or v3.2.2 due to issue #203 I couldn't spend more time to figure this out yet. I'm gonna check if changing the log level as you suggest gives us useful info in the current branch.

With respect to my config I have:

LOAD database
        from mysql://[url]
        into postgresql://[url]
        with include no drop,
                disable triggers,
                create no indexes,
                no foreign keys,
                downcase identifiers,
                quote identifiers
        CAST type date to date drop not null drop default using zero-dates-to-null,
             type timestamp to timestamp drop not null drop default using zero-dates-to-null,
             type enum to varchar drop not null drop default using empty-string-to-null
        BEFORE LOAD do
        $$ drop schema public cascade; $$,
        $$ create schema public; $$;

I disabled indexes and other things (does that discard the CREATE INDEX hypothesis?)

Here is the output to check if there is anything strange in Postgres side when pgloader is stuck.

db=# select pid, waiting, state, query from pg_stat_activity where pid <> pg_backend_pid();
  pid  | waiting | state |  query   
-------+---------+-------+----------
 19768 | f       | idle  | SELECT 1
 23422 | f       | idle  | COMMIT
(2 rows)
manuca commented 8 years ago

Some more info, when I stopped the stuck pgloader process with ctrl-c I got this dump:

Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {1007ADE6B3}>
0: ("bogus stack frame")
1: ((FLET #:WITHOUT-INTERRUPTS-BODY-558 :IN SB-THREAD:CONDITION-WAIT))
2: (SB-THREAD:CONDITION-WAIT #<SB-THREAD:WAITQUEUE Anonymous condition variable {100A5DDD23}> #<SB-THREAD:MUTEX "Anonymous lock" (free)> :TIMEOUT NIL)
3: (LPARALLEL.CONS-QUEUE:POP-CONS-QUEUE/NO-LOCK #<unavailable lambda list>)
4: ((FLET #:WITHOUT-INTERRUPTS-BODY-639 :IN SB-THREAD::CALL-WITH-MUTEX))
5: (SB-THREAD::CALL-WITH-MUTEX #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK :IN LPARALLEL.CONS-QUEUE:POP-CONS-QUEUE) {7FFFE097F11B}> #<SB-THREAD:MUTEX "Anonymous lock" (free)> NIL
 T NIL)
6: (LPARALLEL.CONS-QUEUE:POP-CONS-QUEUE #<unavailable lambda list>)
7: (LPARALLEL.KERNEL:RECEIVE-RESULT #S(LPARALLEL.KERNEL:CHANNEL :QUEUE #S(LPARALLEL.CONS-QUEUE:CONS-QUEUE :IMPL #S(LPARALLEL.RAW-QUEUE:RAW-QUEUE :HEAD NIL :TAIL NIL) :LOCK #<SB-T
HREAD:MUTEX "Anonymous lock" (free)> :CVAR #<SB-THREAD:WAITQUEUE Anonymous condition variable {100A5DDD23}>) :KERNEL #<LPARALLEL.KERNEL:KERNEL :NAME "lparallel" :WORKER-COUNT 4 :
USE-CALLER NIL :ALIVE T :SPIN-COUNT 2000 {100901AF53}>))
8: ((:METHOD PGLOADER.SOURCES:COPY-DATABASE (PGLOADER.SOURCES:DB-COPY)) #<PGLOADER.MYSQL:COPY-MYSQL {1008E80193}> :ON-ERROR-STOP NIL :WORKER-COUNT 4 :CONCURRENCY 1 :TRUNCATE NIL 
:DISABLE-TRIGGERS T :DATA-ONLY NIL :SCHEMA-ONLY NIL :CREATE-SCHEMAS NIL :CREATE-TABLES T :INCLUDE-DROP NIL :CREATE-INDEXES NIL :INDEX-NAMES :UNIQUIFY :RESET-SEQUENCES T :FOREIGN-
KEYS NIL :ONLY-TABLES NIL :INCLUDING NIL :EXCLUDING NIL :SET-TABLE-OIDS T :ALTER-TABLE NIL :ALTER-SCHEMA NIL :MATERIALIZE-VIEWS NIL) [fast-method]
9: ((LAMBDA ()))
10: (PGLOADER:RUN-COMMANDS #P"/home/manuca/etl.load" :START-LOGGER NIL :SUMMARY NIL :LOG-FILENAME NIL :LOG-MIN-MESSAGES NIL :CLIENT-MIN-MESSAGES NIL)
11: ((FLET #:CLEANUP-FUN-283 :IN PGLOADER::MAIN)) [cleanup]
12: (PGLOADER::MAIN ("/home/manuca/bin/pgloader_master" "-d" "etl.load"))
13: ((LAMBDA NIL :IN "/home/manuca/src/pgloader/dumper-2SKVI5f7.lisp"))
14: ((FLET #:WITHOUT-INTERRUPTS-BODY-85 :IN SAVE-LISP-AND-DIE))
15: ((LABELS SB-IMPL::RESTART-LISP :IN SAVE-LISP-AND-DIE))

An unhandled error condition has been signalled:
   Interactive interrupt at #x7FFFF73E8A07.

Date/time: 2016-03-29-16:03An unhandled error condition has been signalled:
                              Interactive interrupt at #x7FFFF73E8A07.
dimitri commented 8 years ago

Looks like it could be an off-by-one receive-result call, which is used to wait for all tables to be done loading, then for all indexes to be created. The option create no indexes completely disable the index code path including the receive-result calls.

On a quick code read I can't see the off-by-one, so next step is going to be reading through the logs as edited (with changes from the :debug level to the :log level for some key messages, all found in pgloader:src/sources/common/methods.lisp and pgloader:src/pgsql/pgsql.lisp.

Thanks for your help and time spent in chasing that one bug!

manuca commented 8 years ago

Thanks @dimitri I recompiled pg_loader raising the log level on those lines you mentioned, here's the output (I renamed the tables), it hangs when it reaches table_9.

Let me know if you need more data Thanks

~/bin/pgloader_debug etl.load
WARNING: Setting locale failed.
  Check the following variables for correct values:
  LC_CTYPE=UTF-8
  LANG=en_US.UTF-8
2016-03-31T15:40:21.029000-04:00 LOG Main logs in '/tmp/pgloader/pgloader.log'
2016-03-31T15:40:21.033000-04:00 LOG Data errors in '/tmp/pgloader/'
2016-03-31T15:40:21.033000-04:00 LOG Parsing commands from file #P"/home/manuca/etl.load"
2016-03-31T15:40:24.532000-04:00 LOG Finished processing READER for "table_1"      0.043000s
2016-03-31T15:40:24.532000-04:00 LOG Finished processing WORKER for "table_1"      0.015000s
2016-03-31T15:40:24.532000-04:00 LOG Finished processing WRITER for "table_1"      0.000000s
2016-03-31T15:40:24.532000-04:00 LOG writers-counts[table_1] = 0
2016-03-31T15:40:25.005000-04:00 LOG Finished processing READER for "table_2"          0.400000s
2016-03-31T15:40:46.746000-04:00 LOG Finished processing READER for "table_3"        22.332000s
2016-03-31T15:40:46.747000-04:00 LOG Finished processing WORKER for "table_3"        22.407000s
2016-03-31T15:40:46.947000-04:00 LOG Finished processing WORKER for "table_2"          0.123000s
2016-03-31T15:40:47.564000-04:00 LOG Finished processing WRITER for "table_2"          0.664000s
2016-03-31T15:40:47.564000-04:00 LOG writers-counts[table_2] = 0
2016-03-31T15:40:47.813000-04:00 LOG Finished processing WRITER for "table_3"        8.259000s
2016-03-31T15:40:47.813000-04:00 LOG writers-counts[table_3] = 0
2016-03-31T15:40:49.462000-04:00 LOG Finished processing READER for "table_4"     1.981000s
2016-03-31T15:41:27.624000-04:00 LOG Finished processing READER for "table_5"         39.842000s
2016-03-31T15:41:28.946000-04:00 LOG Finished processing WORKER for "table_5"         64.022000s
2016-03-31T15:41:29.152000-04:00 LOG Finished processing WORKER for "table_4"     1.410000s
2016-03-31T15:41:29.355000-04:00 LOG Finished processing WRITER for "table_5"         15.998000s
2016-03-31T15:41:29.355000-04:00 LOG writers-counts[table_5] = 0
2016-03-31T15:41:29.355000-04:00 LOG Finished processing READER for "table_5"   0.050000s
2016-03-31T15:41:29.355000-04:00 LOG Finished processing WORKER for "table_5"   39.841000s
2016-03-31T15:41:30.366000-04:00 LOG Finished processing WRITER for "table_5"   1.001000s
2016-03-31T15:41:30.366000-04:00 LOG writers-counts[table_5] = 0
2016-03-31T15:41:30.366000-04:00 LOG Finished processing WRITER for "table_4"     1.315000s
2016-03-31T15:41:30.366000-04:00 LOG writers-counts[table_4] = 0
2016-03-31T15:42:59.954000-04:00 LOG Finished processing READER for "table_6"      89.560000s
2016-03-31T15:43:03.330000-04:00 LOG Finished processing WORKER for "table_6"      94.002000s
2016-03-31T15:43:08.297000-04:00 LOG Finished processing WRITER for "table_6"      31.424000s
2016-03-31T15:43:08.297000-04:00 LOG writers-counts[table_6] = 0
2016-03-31T15:43:44.226000-04:00 LOG Finished processing READER for "table_7"  133.814000s
2016-03-31T15:43:45.147000-04:00 LOG Finished processing WORKER for "table_7"  45.207000s
2016-03-31T15:43:45.703000-04:00 LOG Finished processing WRITER for "table_7"  18.762000s
2016-03-31T15:43:45.703000-04:00 LOG writers-counts[table_7] = 0
2016-03-31T15:46:22.503000-04:00 LOG Finished processing READER for "table_8"              194.201000s
2016-03-31T15:46:24.682000-04:00 LOG Finished processing WORKER for "table_8"              160.540000s
2016-03-31T15:46:27.575000-04:00 LOG Finished processing WRITER for "table_8"              59.089000s
2016-03-31T15:46:27.575000-04:00 LOG writers-counts[table_8] = 0
2016-03-31T15:46:27.785000-04:00 LOG Finished processing READER for "table_9"  0.226000s
emailgregn commented 8 years ago

My stack trace ...

2016-04-01T10:17:02.121000+02:00 DEBUG finish-batch[0] 16405 rows in 0.835000s
2016-04-01T10:17:03.318000+02:00 DEBUG finish-batch[0] 24120 rows in 1.002000s
2016-04-01T10:17:04.209000+02:00 DEBUG finish-batch[0] 18274 rows in 1.016000s
2016-04-01T10:17:05.152000+02:00 DEBUG finish-batch[0] 22855 rows in 0.970000s
2016-04-01T10:17:06.092000+02:00 DEBUG finish-batch[0] 22631 rows in 0.937000s
2016-04-01T10:17:07.442000+02:00 DEBUG finish-batch[0] 25928 rows in 1.319000s
2016-04-01T10:17:08.180000+02:00 DEBUG finish-batch[0] 15360 rows in 0.619000s
2016-04-01T10:17:08.911000+02:00 DEBUG finish-batch[0] 14808 rows in 0.565000s
2016-04-01T10:17:09.569000+02:00 DEBUG finish-batch[0] 14525 rows in 0.956000s
2016-04-01T10:17:10.508000+02:00 DEBUG finish-batch[0] 23469 rows in 0.954000s

^CAn unhandled error condition has been signalled:
   Interactive interrupt at #x7FFFF73EB9B7.

Date/time: 2016-04-01-10:38An unhandled error condition has been signalled:
                              Interactive interrupt at #x7FFFF73EB9B7.

Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {1007AE6793}>
0: ("bogus stack frame")
1: ((FLET #:WITHOUT-INTERRUPTS-BODY-558 :IN SB-THREAD:CONDITION-WAIT))
2: (SB-THREAD:CONDITION-WAIT #<SB-THREAD:WAITQUEUE Anonymous condition variable {100C8A0653}> #<SB-THREAD:MUTEX "Anonymous lock" (free)> :TIMEOUT NIL)
3: (LPARALLEL.CONS-QUEUE:POP-CONS-QUEUE/NO-LOCK #<unavailable lambda list>)
4: ((FLET #:WITHOUT-INTERRUPTS-BODY-639 :IN SB-THREAD::CALL-WITH-MUTEX))
5: (SB-THREAD::CALL-WITH-MUTEX #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK :IN LPARALLEL.CONS-QUEUE:POP-CONS-QUEUE) {7FFFE09870DB}> #<SB-THREAD:MUTEX "Anonymous lock" (free)> NIL T NIL)
6: (LPARALLEL.CONS-QUEUE:POP-CONS-QUEUE #<unavailable lambda list>)
7: (LPARALLEL.KERNEL:RECEIVE-RESULT #S(LPARALLEL.KERNEL:CHANNEL :QUEUE #S(LPARALLEL.CONS-QUEUE:CONS-QUEUE :IMPL #S(LPARALLEL.RAW-QUEUE:RAW-QUEUE :HEAD NIL :TAIL NIL) :LOCK #<SB-THREAD:MUTEX "Anonymous lock" (free)> :CVAR #<SB-THREAD:WAITQUEUE Anonymous condition variable {100C8A0653}>) :KERNEL #<LPARALLEL.KERNEL:KERNEL :NAME "lparallel" :WORKER-COUNT 4 :USE-CALLER NIL :ALIVE T :SPIN-COUNT 2000 {1008F8A553}>))
8: ((:METHOD PGLOADER.SOURCES:COPY-DATABASE (PGLOADER.SOURCES:DB-COPY)) #<PGLOADER.MYSQL:COPY-MYSQL {1008DF0193}> :ON-ERROR-STOP NIL :WORKER-COUNT 4 :CONCURRENCY 1 :TRUNCATE NIL :DISABLE-TRIGGERS NIL :DATA-ONLY NIL :SCHEMA-ONLY NIL :CREATE-SCHEMAS NIL :CREATE-TABLES T :INCLUDE-DROP T :CREATE-INDEXES T :INDEX-NAMES :UNIQUIFY :RESET-SEQUENCES T :FOREIGN-KEYS T :ONLY-TABLES NIL :INCLUDING NIL :EXCLUDING NIL :SET-TABLE-OIDS T :ALTER-TABLE NIL :ALTER-SCHEMA NIL :MATERIALIZE-VIEWS NIL) [fast-method]
9: ((LAMBDA ()))
10: (PGLOADER:RUN-COMMANDS #P"/home/xxxxxx/Projects/pgloader_scripts/DEBUG.load" :START-LOGGER NIL :SUMMARY NIL :LOG-FILENAME NIL :LOG-MIN-MESSAGES NIL :CLIENT-MIN-MESSAGES NIL)
11: ((FLET #:CLEANUP-FUN-283 :IN PGLOADER::MAIN)) [cleanup]
12: (PGLOADER::MAIN ("pgloader" "--debug" "DEBUG.load"))
13: ((LAMBDA NIL :IN "/home/xxxxxx/src/pgloader/dumper-2SKVI5f7.lisp"))
14: ((FLET #:WITHOUT-INTERRUPTS-BODY-85 :IN SAVE-LISP-AND-DIE))
15: ((LABELS SB-IMPL::RESTART-LISP :IN SAVE-LISP-AND-DIE))

An unhandled error condition has been signalled:
   Interactive interrupt at #x7FFFF73EB9B7.

Date/time: 2016-04-01-10:38An unhandled error condition has been signalled:
                              Interactive interrupt at #x7FFFF73EB9B7.

Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {1007AE6793}>
0: ((LAMBDA NIL :IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX))
1: (SB-IMPL::CALL-WITH-SANE-IO-SYNTAX #<CLOSURE (LAMBDA NIL :IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX) {101C5490BB}>)
2: (SB-IMPL::%WITH-STANDARD-IO-SYNTAX #<CLOSURE (LAMBDA NIL :IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX) {101C54908B}>)
3: (PRINT-BACKTRACE :STREAM #<SYNONYM-STREAM :SYMBOL SB-SYS:*STDOUT* {1000287203}> :START 0 :FROM :DEBUGGER-FRAME :COUNT 4611686018427387903 :PRINT-THREAD T :PRINT-FRAME-SOURCE NIL :METHOD-FRAME-STYLE NIL)
4: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE-TO-STREAM #<SYNONYM-STREAM :SYMBOL SB-SYS:*STDOUT* {1000287203}>)
5: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE #<SB-SYS:INTERACTIVE-INTERRUPT {101C4F8113}> :OUTPUT #<SYNONYM-STREAM :SYMBOL SB-SYS:*STDOUT* {1000287203}> :IF-EXISTS :APPEND :VERBOSE T)
6: ((FLET #:CLEANUP-FUN-283 :IN PGLOADER::MAIN)) [cleanup]
7: (PGLOADER::MAIN ("pgloader" "--debug" "DEBUG.load"))
8: ((LAMBDA NIL :IN "/home/xxxxxx/src/pgloader/dumper-2SKVI5f7.lisp"))
9: ((FLET #:WITHOUT-INTERRUPTS-BODY-85 :IN SAVE-LISP-AND-DIE))
10: ((LABELS SB-IMPL::RESTART-LISP :IN SAVE-LISP-AND-DIE))
emailgregn commented 8 years ago

This is another run, pgloader recompiled with LOG output.

$ pgloader DEBUG.load 
2016-04-01T10:47:15.025000+02:00 LOG Main logs in '/tmp/pgloader/pgloader.log'
2016-04-01T10:47:15.030000+02:00 LOG Data errors in '/tmp/pgloader/'
2016-04-01T10:47:15.030000+02:00 LOG Parsing commands from file #P"/home/xxxxxx/Projects/pgloader_scripts/DEBUG.load"
2016-04-01T10:50:27.545000+02:00 LOG Finished processing READER for "table_1"          190.655000s
2016-04-01T10:50:27.545000+02:00 LOG Finished processing WORKER for "table_1"          190.867000s
2016-04-01T10:50:28.259000+02:00 LOG Finished processing WRITER for "table_1"          93.702000s
2016-04-01T10:50:28.260000+02:00 LOG writers-counts[table_1] = 0
2016-04-01T10:50:32.749000+02:00 LOG Finished processing READER for "table_3"       4.243000s
2016-04-01T10:54:58.057000+02:00 LOG Finished processing READER for "table_2"      270.708000s
2016-04-01T10:54:58.502000+02:00 LOG Finished processing WORKER for "table_2"      461.645000s
2016-04-01T10:54:58.703000+02:00 LOG Finished processing WRITER for "table_2"      123.545000s
2016-04-01T10:54:58.703000+02:00 LOG writers-counts[table_2] = 0
2016-04-01T10:54:59.686000+02:00 LOG Finished processing WORKER for "table_3"       1.515000s
2016-04-01T10:54:59.887000+02:00 LOG Finished processing WRITER for "table_3"       0.635000s
2016-04-01T10:54:59.887000+02:00 LOG writers-counts[table_3] = 0
2016-04-01T10:57:28.785000+02:00 LOG Finished processing READER for "table_4"        150.171000s
2016-04-01T10:57:29.247000+02:00 LOG Finished processing WORKER for "table_4"        416.597000s
2016-04-01T10:57:29.448000+02:00 LOG Finished processing WRITER for "table_4"        76.191000s
2016-04-01T10:57:29.448000+02:00 LOG writers-counts[table_4] = 0

And the monitoring...

database01=# select CURRENT_TIMESTAMP , pid, waiting, state, query from pg_stat_activity where pid <> pg_backend_pid();
              now              |  pid  | waiting | state | query  
-------------------------------+-------+---------+-------+--------
 2016-04-01 08:49:02.085738+00 | 28395 | f       | idle  | 
 2016-04-01 08:49:02.085738+00 | 17080 | f       | idle  | COMMIT
(2 rows)

database01=# select CURRENT_TIMESTAMP , pid, waiting, state, query from pg_stat_activity where pid <> pg_backend_pid();
              now              |  pid  | waiting | state  |                                                                      query                                    

-------------------------------+-------+---------+--------+------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------
 2016-04-01 08:50:32.219118+00 | 28395 | f       | idle   | 
 2016-04-01 08:50:32.219118+00 | 17510 | f       | active | CREATE UNIQUE INDEX idx_36116_primary ON table_1 (table_1_id);
 2016-04-01 08:50:32.219118+00 | 17509 | f       | active |                                                                                    
                                                                                          +
                               |       |         |        | copy table_2 (table_2_id,...<SNIP>...) FROM STDIN
(3 rows)

database01=# select CURRENT_TIMESTAMP , pid, waiting, state, query from pg_stat_activity where pid <> pg_backend_pid();
              now              |  pid  | waiting | state  |                                                                      query         

-------------------------------+-------+---------+--------+------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------
 2016-04-01 08:54:12.311957+00 | 28395 | f       | idle   | 
 2016-04-01 08:54:12.311957+00 | 17597 | f       | active | autovacuum: ANALYZE qqqqqq.table_2
 2016-04-01 08:54:12.311957+00 | 17509 | f       | active |                                                                                    
                                                                                          +
                               |       |         |        | copy table_2 (table_2_id,...<SNIP>...) FROM STDIN
(3 rows)

database01=# select CURRENT_TIMESTAMP , pid, waiting, state, query from pg_stat_activity where pid <> pg_backend_pid();
              now              |  pid  | waiting |        state        |                                                                       
                  query                                                                                                                     
-------------------------------+-------+---------+---------------------+-----------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------
 2016-04-01 08:55:20.010652+00 | 28395 | f       | idle                | 
 2016-04-01 08:55:20.010652+00 | 17611 | f       | idle in transaction |                                                                       
                                                                                                                                           +
                               |       |         |                     | copy table_4 (table_4_id,...<SNIP>...) FROM STDIN
(2 rows)

database01=# select CURRENT_TIMESTAMP , pid, waiting, state, query from pg_stat_activity where pid <> pg_backend_pid();
              now              |  pid  | waiting | state | query  
-------------------------------+-------+---------+-------+--------
 2016-04-01 08:57:02.173661+00 | 28395 | f       | idle  | 
 2016-04-01 08:57:02.173661+00 | 17611 | f       | idle  | COMMIT
(2 rows)

database01=# select CURRENT_TIMESTAMP , pid, waiting, state, query from pg_stat_activity where pid <> pg_backend_pid();
             now              |  pid  | waiting | state | query  
------------------------------+-------+---------+-------+--------
 2016-04-01 09:00:39.99215+00 | 28395 | f       | idle  | 
 2016-04-01 09:00:39.99215+00 | 17648 | f       | idle  | COMMIT
(2 rows)

database01=# select CURRENT_TIMESTAMP , pid, waiting, state, query from pg_stat_activity where pid <> pg_backend_pid();
              now              |  pid  | waiting | state | query  
-------------------------------+-------+---------+-------+--------
 2016-04-01 09:13:07.749294+00 | 28395 | f       | idle  | 
 2016-04-01 09:13:07.749294+00 | 17648 | f       | idle  | COMMIT
(2 rows)
manuca commented 8 years ago

@dimitri any clue about what could be creating this issue and a possible solution? Thanks

dimitri commented 8 years ago

It still looks like a one-to-many call to receive-result, which is blocking forever when there's no result to receive anymore. There are several ways to fix that issue, the best one would be to understand why we expect a result that never shows up... maybe another thread has some non-reported error or is waiting for us in the interactive debugger or something, although that's quite unlikely.

Is there a way that you can send me a database dump from which I could reproduce the problem locally?

manuca commented 8 years ago

Unfortunately I can't send you a dump of this DB, I wish I could reproduce the bug on a DB without sensitive information.

dimitri commented 8 years ago

Can you try to reproduce with the schema only, no data? if the bug still happens maybe you can share? Other than that if signing some acceptable papers allows us to work more closely here (sharing a cloud test instance or something) let's do that?

emailgregn commented 8 years ago

I just tried a migration with the docker image dimitri/pgloader:ccl.latest and a pg.load file pointing to a source schema with 6 tables. It hung about 1/2 way through but did process the largest table successfully.

dimitri commented 8 years ago

Can you share the source schema in order for me to reproduce the bug locally and get it fixed, please? Either here if possible, or by private email (see my profile). Thanks!

dimitri commented 8 years ago

I got enough data to be able to reproduce a case that looks like this one, please retry!

mnietz commented 8 years ago

Looks good, i did a test with a table that previously frequently run into that problem. The migration has worked three times in a row.

When i have the time i will try a test migration of the whole database again.

emailgregn commented 8 years ago

Sorry, still hanging for me. Heres the stack trace after ^Cing it.

^CAn unhandled error condition has been signalled:
   Interactive interrupt at #x7FFFF73EB9B7.

Date/time: 2016-04-21-15:47An unhandled error condition has been signalled:
                              Interactive interrupt at #x7FFFF73EB9B7.

Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {1007B06793}>
0: ("bogus stack frame")
1: ((FLET #:WITHOUT-INTERRUPTS-BODY-558 :IN SB-THREAD:CONDITION-WAIT))
2: (SB-THREAD:CONDITION-WAIT #<SB-THREAD:WAITQUEUE Anonymous condition variable {100C107723}> #<SB-THREAD:MUTEX "Anonymous lock" (free)> :TIMEOUT NIL)
3: (LPARALLEL.CONS-QUEUE:POP-CONS-QUEUE/NO-LOCK #<unavailable lambda list>)
4: ((FLET #:WITHOUT-INTERRUPTS-BODY-639 :IN SB-THREAD::CALL-WITH-MUTEX))
5: (SB-THREAD::CALL-WITH-MUTEX #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK :IN LPARALLEL.CONS-QUEUE:POP-CONS-QUEUE) {7FFFE09870CB}> #<SB-THREAD:MUTEX "Anonymous lock" (free)> NIL T NIL)
6: (LPARALLEL.CONS-QUEUE:POP-CONS-QUEUE #<unavailable lambda list>)
7: (LPARALLEL.KERNEL:RECEIVE-RESULT #S(LPARALLEL.KERNEL:CHANNEL :QUEUE #S(LPARALLEL.CONS-QUEUE:CONS-QUEUE :IMPL #S(LPARALLEL.RAW-QUEUE:RAW-QUEUE :HEAD NIL :TAIL NIL) :LOCK #<SB-THREAD:MUTEX "Anonymous lock" (free)> :CVAR #<SB-THREAD:WAITQUEUE Anonymous condition variable {100C107723}>) :KERNEL #<LPARALLEL.KERNEL:KERNEL :NAME "lparallel" :WORKER-COUNT 4 :USE-CALLER NIL :ALIVE T :SPIN-COUNT 2000 {1008FE2953}>))
8: ((:METHOD PGLOADER.SOURCES:COPY-DATABASE (PGLOADER.SOURCES:DB-COPY)) #<PGLOADER.MYSQL:COPY-MYSQL {1008E465C3}> :ON-ERROR-STOP NIL :WORKER-COUNT 4 :CONCURRENCY 1 :MAX-PARALLEL-CREATE-INDEX NIL :TRUNCATE NIL :DISABLE-TRIGGERS NIL :DATA-ONLY NIL :SCHEMA-ONLY NIL :CREATE-SCHEMAS NIL :CREATE-TABLES T :INCLUDE-DROP T :CREATE-INDEXES T :INDEX-NAMES :UNIQUIFY :RESET-SEQUENCES T :FOREIGN-KEYS T :ONLY-TABLES NIL :INCLUDING NIL :EXCLUDING NIL :SET-TABLE-OIDS T :ALTER-TABLE NIL :ALTER-SCHEMA NIL :MATERIALIZE-VIEWS NIL) [fast-method]
9: ((LAMBDA ()))
10: (PGLOADER:RUN-COMMANDS #P"/home/xxxxxx/Projects/pgloader_scripts/DEBUG.load" :START-LOGGER NIL :SUMMARY NIL :LOG-FILENAME NIL :LOG-MIN-MESSAGES NIL :CLIENT-MIN-MESSAGES NIL)
11: ((FLET #:CLEANUP-FUN-283 :IN PGLOADER::MAIN)) [cleanup]
12: (PGLOADER::MAIN ("pgloader" "--debug" "DEBUG.load"))
13: ((LAMBDA NIL :IN "/home/xxxxxx/bak.src/pgloader/dumper-2SKVI5f7.lisp"))
14: ((FLET #:WITHOUT-INTERRUPTS-BODY-85 :IN SAVE-LISP-AND-DIE))
15: ((LABELS SB-IMPL::RESTART-LISP :IN SAVE-LISP-AND-DIE))

An unhandled error condition has been signalled:
   Interactive interrupt at #x7FFFF73EB9B7.

Date/time: 2016-04-21-15:47An unhandled error condition has been signalled:
                              Interactive interrupt at #x7FFFF73EB9B7.

Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {1007B06793}>
0: ((LAMBDA NIL :IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX))
1: (SB-IMPL::CALL-WITH-SANE-IO-SYNTAX #<CLOSURE (LAMBDA NIL :IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX) {10241EA17B}>)
2: (SB-IMPL::%WITH-STANDARD-IO-SYNTAX #<CLOSURE (LAMBDA NIL :IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX) {10241EA14B}>)
3: (PRINT-BACKTRACE :STREAM #<SYNONYM-STREAM :SYMBOL SB-SYS:*STDOUT* {100028F203}> :START 0 :FROM :DEBUGGER-FRAME :COUNT 4611686018427387903 :PRINT-THREAD T :PRINT-FRAME-SOURCE NIL :METHOD-FRAME-STYLE NIL)
4: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE-TO-STREAM #<SYNONYM-STREAM :SYMBOL SB-SYS:*STDOUT* {100028F203}>)
5: (TRIVIAL-BACKTRACE:PRINT-BACKTRACE #<SB-SYS:INTERACTIVE-INTERRUPT {1024198113}> :OUTPUT #<SYNONYM-STREAM :SYMBOL SB-SYS:*STDOUT* {100028F203}> :IF-EXISTS :APPEND :VERBOSE T)
6: ((FLET #:CLEANUP-FUN-283 :IN PGLOADER::MAIN)) [cleanup]
7: (PGLOADER::MAIN ("pgloader" "--debug" "DEBUG.load"))
8: ((LAMBDA NIL :IN "/home/xxxxxx/bak.src/pgloader/dumper-2SKVI5f7.lisp"))
9: ((FLET #:WITHOUT-INTERRUPTS-BODY-85 :IN SAVE-LISP-AND-DIE))
10: ((LABELS SB-IMPL::RESTART-LISP :IN SAVE-LISP-AND-DIE))

[xxxxxx@ZZZZZZ pgloader_scripts]$ pgloader --version
pgloader version "3.3.7af6c7a"
compiled with SBCL 1.2.8-1.el7
[xxxxxx@ZZZZZZ pgloader_scripts]$
dimitri commented 8 years ago

Can you please try to run it in a developer environment, rather than the command line, so that you can easily have access to the interactive debugger and maybe have a better stack trace (this one looks like the generic off-by-one error that seems to be happening way after the real problems we need to solve).

To do that, install a recent enough SBCL (anything 1.2 or 1.3 era, see http://sbcl.org), then install Quicklisp on it (see https://www.quicklisp.org/beta/#installation), then at the prompt do:

$ sbcl
* (ql:quickload "pgloader")
* (pgloader::run-commands "path/to/file.load" :client-min-messages :notice)

And let's see what happens then. You should reach the interactive debugger and type in back to get a backtrace, then ^D your way out to the shell.

mnietz commented 8 years ago

Ok, tried it again with the whole database and still have this issue. It's a bit strange, that the single table has worked that fine.

dimitri commented 8 years ago

Do you have any new logs or traces or things that would help figure out the situation? Can you do the “lisp developer” test as above?

StephenChan commented 8 years ago

I'm getting this issue now. As others have reported, pgloader ports my tables fine individually, but not all together. More generally, it seems that porting a smaller subset of my tables works, but porting all or a larger subset of my tables results in an indefinite hang.

The hang happens when pgloader is built with CCL according to the Docker file. Previously I was trying with a SBCL build of pgloader, and was running into issue #327 instead ("Heap exhausted, game over").

I think I just did a typical installation of SBCL, but it brought me to a debugger upon the heap-exhaustion message. From there I typed back. Here's the output if it helps any:

Heap exhausted during garbage collection: 480 bytes available, 832 requested.
 Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB   LUB  !move  Alloc  Waste   Trig    WP  GCs Mem-age
   0:     0     0     0     0     0     0     0     0     0        0     0 42949672    0   0  0.0000
   1:     0     0     0     0     0     0     0     0     0        0     0 42949672    0   0  0.0000
   2: 113964 124237     0     0  3221 34488     5     0    11 1226768288 9044064 42949672    0   0  1.0415
   3: 81167 122649     0     0  4492 42335   304     0    47 1538635040 5753568 42949672 4676   0  0.4966
   4: 14436 45001     0     0  3636 19584   336     0    99 767806992 4076016  2000000 3802   0  0.1759
   5:     0     0     0     0     0     0     0     0     0        0     0  2000000    0   0  0.0000
   6:     0     0     0     0  2427  1135     0     0     0 116719616     0  2000000 2258   0  0.0000
   Total bytes allocated    = 4271938096
   Dynamic-space-size bytes = 4294967296
GC control variables:
   *GC-INHIBIT* = true
   *GC-PENDING* = true
   *STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 16470(tid 140737026979584):
Heap exhausted, game over.

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb> back
Backtrace:
   0: SB-KERNEL::%MAKE-ARRAY, (LAMBDA (SB-IMPL::A SB-IMPL::B) :IN SB-KERNEL::%MAKE-ARRAY), (LAMBDA (SB-IMPL::VALUE) :IN SB-IMPL::%SAVE-DISPLACED-ARRAY-BACKPOINTER), pc = 0x10001cda9a, fp = 0x7fffe47fe038
   1: PGLOADER.PGSQL::FORMAT-VECTOR-ROW, pc = 0x10029b193f, fp = 0x7fffe47fe0f0
   2: (SB-PCL::FAST-METHOD PGLOADER.SOURCES::FORMAT-DATA-TO-COPY (PGLOADER.SOURCES::COPY T T)), (FLET H0 :IN PGLOADER.SOURCES::FORMAT-DATA-TO-COPY), pc = 0x1003c8fd49, fp = 0x7fffe47fe248
   3: LPARALLEL.KERNEL::MAKE-CHANNELED-TASK, (LAMBDA () :IN LPARALLEL.KERNEL::MAKE-CHANNELED-TASK), (FLET BODY-FN635 :IN LPARALLEL.KERNEL::MAKE-CHANNELED-TASK), pc = 0x1002c343fe, fp = 0x7fffe47fe308
   4: LPARALLEL.KERNEL::MAKE-CHANNELED-TASK, (LAMBDA () :IN LPARALLEL.KERNEL::MAKE-CHANNELED-TASK), (FLET BODY-FN635 :IN LPARALLEL.KERNEL::MAKE-CHANNELED-TASK), pc = 0x1002c3428b, fp = 0x7fffe47fe3c8
   5: LPARALLEL.KERNEL::EXEC-TASK/WORKER, pc = 0x1002c36430, fp = 0x7fffe47fe440
   6: LPARALLEL.KERNEL::WORKER-LOOP, pc = 0x1002c3ddff, fp = 0x7fffe47fe4c0
   7: LPARALLEL.KERNEL::%CALL-WITH-TASK-HANDLER, pc = 0x1002c388ff, fp = 0x7fffe47fe510
   8: LPARALLEL.KERNEL::CALL-WITH-WORKER-CONTEXT, (LAMBDA () :IN LPARALLEL.KERNEL::CALL-WITH-WORKER-CONTEXT), pc = 0x1002c4321c, fp = 0x7fffe47fe5d8
   9: LPARALLEL.KERNEL::CALL-WITH-WORKER-CONTEXT, (LAMBDA () :IN LPARALLEL.KERNEL::CALL-WITH-WORKER-CONTEXT), pc = 0x1002c42e27, fp = 0x7fffe47fe678
  10: LPARALLEL.KERNEL::MAKE-WORKER-THREAD, (LAMBDA () :IN LPARALLEL.KERNEL::MAKE-WORKER-THREAD), pc = 0x1002c40299, fp = 0x7fffe47fe700
  11: BORDEAUX-THREADS::BINDING-DEFAULT-SPECIALS, (LAMBDA () :IN BORDEAUX-THREADS::BINDING-DEFAULT-SPECIALS), pc = 0x100352d172, fp = 0x7fffe47fe798
  12: SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE, (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (LAMBDA (COND) :IN "SYS:SRC;CODE;TARGET-THREAD.LISP"), (LAMBDA (STREAM) :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (LAMBDA () :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (FLET SB-THREAD::WITH-SYSTEM-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (FLET SB-THREAD::WITH-SYSTEM-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), pc = 0x100019184f, fp = 0x7fffe47fe8f0
  13: SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE, (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (LAMBDA (COND) :IN "SYS:SRC;CODE;TARGET-THREAD.LISP"), (LAMBDA (STREAM) :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (LAMBDA () :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (FLET SB-THREAD::WITH-SYSTEM-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (FLET SB-THREAD::WITH-SYSTEM-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), pc = 0x1000190d7c, fp = 0x7fffe47feab0
  14: SB-THREAD::CALL-WITH-MUTEX, pc = 0x100018194f, fp = 0x7fffe47feb78
  15: SB-THREAD::CALL-WITH-MUTEX, pc = 0x1000181574, fp = 0x7fffe47fec40
  16: SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE, (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (LAMBDA (COND) :IN "SYS:SRC;CODE;TARGET-THREAD.LISP"), (LAMBDA (STREAM) :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (LAMBDA () :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (FLET SB-THREAD::WITH-SYSTEM-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), (FLET SB-THREAD::WITH-SYSTEM-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE), pc = 0x1000190546, fp = 0x7fffe47fedf0
  17: Foreign function call_into_lisp, pc = 0x4280ef, fp = 0x7fffe47fee30
  18: Foreign function new_thread_trampoline, pc = 0x41e04a, fp = 0x7fffe47feee0

Also for what it's worth: When the hang happened on CCL, I checked the process through a second SSH session, and ps aux showed a virtual memory size (VSZ) of 537858452. I'm not 100% sure how to read that number, but it seemed a little bit high.

SBCL got its heap exhausted with a large single table from my database, which CCL was able to handle. My impression so far is that CCL merely increased the memory capacity or made it not leak quite as fast compared to SBCL, but they both ran into a similar memory usage problem.

I'm wondering, is there any evidence so far that "port more than x GB of data" isn't the only criterion for this to happen (with x varying between different CL versions and different systems)?

StephenChan commented 8 years ago

Oh hey, it seems that one of my tables JUST got large enough to trigger the CCL hang by itself. Bad timing for my work, but perhaps good timing for debugging this.

Here's the MySQL table schema as output by show columns, omitting the column names for privacy:

+------------------+----------+------+-----+---------+----------------+
| Field            | Type     | Null | Key | Default | Extra          |
+------------------+----------+------+-----+---------+----------------+
| <name>           | int(11)  | NO   | PRI | NULL    | auto_increment |
| <name>           | datetime | NO   |     | NULL    |                |
| <name>           | int(11)  | NO   | MUL | NULL    |                |
| <name>           | int(11)  | NO   | MUL | NULL    |                |
| <name>           | int(11)  | NO   | MUL | NULL    |                |
| <name>           | int(11)  | NO   | MUL | NULL    |                |
| <name>           | int(11)  | NO   | MUL | NULL    |                |
| <name>           | int(11)  | YES  | MUL | NULL    |                |
+------------------+----------+------+-----+---------+----------------+

And when I told pgloader to port only this table, it got stuck after 8496570 rows.

This is with pgloader git-cloned 2-3 days ago.

dimitri commented 8 years ago

I still don't believe there's a direct correlation in between the number of rows and some memory pressure limit here, because I have had reports of success on much larger data sets and because we batch the rows. The memory exhaustion seems to be a factor of how large an individual batch of rows turns out to be and how many of those batches are concurrently filled and available in memory.

Also, you're saying you can reproduce the hang rather than the exhaustion, and that's a different thing. Is it possible for you to share a database dump of only that table so that I can reproduce the problem locally (with some luck)? it's only integers and a date time, I guess there's no filtering for anonymity of the data involved... You'll find my email on my github profile if you can share the data!

StephenChan commented 8 years ago

All right, I just emailed you a MySQL dump that gets the CCL hang. I was able to rename my columns and overwrite my data with integer 1s and a fixed date, and the CCL hang still happened. In addition to giving me data privacy, this should allow you to add and remove rows freely for testing, without worrying about adhering to some obscure data pattern.

After more experimentation, it seems that the PostgreSQL maintenance_work_mem and work_mem values are quite relevant here: SET maintenance_work_mem to '64MB', work_mem to '4MB' finishes after about 5-6 minutes. (These are the PostgreSQL 9.5 defaults.) SET maintenance_work_mem to '128MB', work_mem to '12MB' hangs. (These values are in the MySQL example in the pgloader docs.) SET maintenance_work_mem to '256MB', work_mem to '32MB' hangs. Here's an example of it finishing with the PostgreSQL 9.5 defaults:

             table name       read   imported     errors      total time       read      write
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
        fetch meta data          8          8          0          0.043s        
           create, drop          0          0          0          1.089s        
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
             test_table    8494733    8494733          0       2m18.710s  3m25.415s 2m18.707s
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
COPY Threads Completion          3          3          0       3m28.630s        
 Index Build Completion          7          7          0       2m31.947s        
         Create Indexes          7          7          0      17m11.154s        
        Reset Sequences          0          1          0          1.918s        
           Primary Keys          1          1          0          0.149s        
           Foreign Keys          0          0          0          0.000s        
       Install comments          0          0          0          0.000s        
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
      Total import time    8494733    8494733          0        6m4.934s  3m25.415s 2m18.707s

The table has 8494733 rows. If I go into MySQL and run delete from test_table where id > 10700000; to cut it down to 8008090 rows, then SET maintenance_work_mem to '128MB', work_mem to '12MB' finishes after about 10 minutes. Here's an example:

             table name       read   imported     errors      total time       read      write
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
        fetch meta data          8          8          0          0.018s        
           create, drop          0          0          0          1.352s        
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
             test_table    8008090    8008090          0        2m9.801s  3m12.179s  2m9.799s
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
COPY Threads Completion          3          3          0       3m14.934s        
 Index Build Completion          7          7          0       6m53.889s        
         Create Indexes          7          7          0       44m1.343s        
        Reset Sequences          0          1          0          1.458s        
           Primary Keys          1          1          0          0.081s        
           Foreign Keys          0          0          0          0.000s        
       Install comments          0          0          0          0.000s        
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
      Total import time    8008090    8008090          0      10m12.954s  3m12.179s  2m9.799s

When it hangs, if I do ps aux | grep pgloader in a second terminal, I can see that the running time of the pgloader process doesn't change anymore; it's typically stuck at around 4 minutes. I typically waited anywhere from 10-20 minutes before deciding that it must have hanged. I wouldn't rule out the idea that the hang scenarios described above could have eventually finished if I waited longer. However, even if one ignored the hang cases, if you just compare these timing stats with the first set of timing stats, something seems a little odd.

I'm running pgloader with a local MySQL 5.5 database and a remote PostgreSQL 9.5 database (an Amazon RDS instance). Besides setting maintenance_work_mem and work_mem, no other options are required in the load command file to reproduce the hang.

Also, in my previous post I mentioned the SBCL heap exhaustion, since earlier you suggested running with SBCL to get a traceback. The heap exhaustion is what happened when I tried running with SBCL (in the case where CCL would hang). I guess I should leave that aside for now though.

dimitri commented 8 years ago

Thanks a lot for sharing the data! Unfortunately I couldn't reproduce the error yet. Here's what I tried:

load database
   from mysql://root@localhost/annotationtest
   into pgsql:///annotationtest

    set maintenance_work_mem to '128MB', work_mem to '12MB';

And here's what I got:

CL-USER> (pgloader:run-commands "/Users/dim/dev/temp/pgloader-issues/337/annotationtest.load" :client-min-messages :notice)
2016-05-22T23:36:55.000446+02:00 NOTICE Starting pgloader, log system is ready.
2016-05-22T23:36:55.000936+02:00 LOG Parsing commands from file #P"/Users/dim/dev/temp/pgloader-issues/337/annotationtest.load"
2016-05-22T23:36:56.009969+02:00 NOTICE DROP then CREATE TABLES
2016-05-22T23:36:56.010220+02:00 NOTICE DROP TABLE IF EXISTS test_table CASCADE;
2016-05-22T23:36:56.010704+02:00 WARNING Postgres warning: table "test_table" does not exist, skipping
2016-05-22T23:36:56.010973+02:00 NOTICE CREATE TABLE test_table 
(
  id    bigserial not null,
  date  timestamptz not null,
  key1  bigint not null,
  key2  bigint not null,
  key3  bigint not null,
  key4  bigint not null,
  key5  bigint not null,
  key6  bigint
);
2016-05-22T23:40:19.627068+02:00 NOTICE CREATE INDEX idx_1505481_annotations_annotation_63868627 ON test_table (key4);
2016-05-22T23:40:19.627548+02:00 NOTICE CREATE INDEX idx_1505481_annotations_annotation_6682136 ON test_table (key2);
2016-05-22T23:40:19.628220+02:00 NOTICE CREATE INDEX idx_1505481_annotations_annotation_fbfc09f1 ON test_table (key3);
2016-05-22T23:40:19.629955+02:00 NOTICE CREATE UNIQUE INDEX idx_1505481_primary ON test_table (id);
2016-05-22T23:40:19.630198+02:00 NOTICE CREATE INDEX idx_1505481_annotations_annotation_9e5a71d1 ON test_table (key1);
2016-05-22T23:40:19.630415+02:00 NOTICE CREATE INDEX idx_1505481_annotations_annotation_89f89e85 ON test_table (key5);
2016-05-22T23:40:19.630633+02:00 NOTICE CREATE INDEX idx_1505481_annotations_annotation_58f6a228 ON test_table (key6);
2016-05-22T23:41:13.768254+02:00 NOTICE Reset sequences
2016-05-22T23:41:14.172405+02:00 NOTICE ALTER TABLE test_table ADD PRIMARY KEY USING INDEX idx_1505481_primary;
2016-05-22T23:41:14.172674+02:00 NOTICE Comments
             table name       read   imported     errors      total time       read      write
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
        fetch meta data          8          8          0          0.768s                     
           create, drop          0          0          0          0.077s                     
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
             test_table    8494733    8494733          0       1m16.956s  3m22.867s 1m16.955s
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
COPY Threads Completion          3          3          0       3m23.473s                     
 Index Build Completion          7          7          0         54.193s                     
         Create Indexes          7          7          0       6m10.545s                     
        Reset Sequences          0          1          0          0.368s                     
           Primary Keys          1          1          0          0.040s                     
           Foreign Keys          0          0          0          0.000s                     
       Install comments          0          0          0          0.000s                     
-----------------------  ---------  ---------  ---------  --------------  ---------  ---------
      Total import time    8494733    8494733          0       4m19.173s  3m22.867s 1m16.955s

That's from the Common Lisp developper environment, a SLIME prompt (see https://common-lisp.net/project/slime/ for details). No luck this time, will try with other parameters in the next days...

StephenChan commented 8 years ago

OK. No pressure on time for now, since I could successfully port my table with 64MB/4MB work_mem. I'll be around to provide additional testing/info if needed though.

emailgregn commented 8 years ago
2016-06-16T11:22:50.831000+02:00 DEBUG finish-batch[0] 40389 rows in 2.327000s
Heap exhausted during garbage collection: 48 bytes available, 64 requested.
 Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB   LUB  !move  Alloc  Waste   Trig    WP  GCs Mem-age
   0:     0     0     0     0     0     0     0     0     0        0     0 42949672    0   0  0.0000
   1:     0     0     0     0     0     0     0     0     0        0     0 42949672    0   0  0.0000
   2:     0     0     0     0     0     0     0     0     0        0     0 42949672    0   0  0.0000
   3: 123251 130720     0     0 15547 66168   140     0    13 2676272480 5952160 42949672    0   0  0.9950
   4:     0     0     0     0     0     0     0     0     0        0     0 42949672    0   0  0.0000
   5: 27235 35933     0     0  2747 11728    82     0   122 475119440 1884336  2000000 2690   0  0.4103
   6:     0     0     0     0  2759  1177     0     0     0 128974848     0  2000000 2581   0  0.0000
   Total bytes allocated    = 4283970000
   Dynamic-space-size bytes = 4294967296
GC control variables:
   *GC-INHIBIT* = true
   *GC-PENDING* = true
   *STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 32581(tid 140736913143552):
Heap exhausted, game over.

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb> backtrace
Backtrace:
   0: Foreign function (null), fp = 0x7fffddb6d130, ra = 0x41098a
   1: Foreign function (null), fp = 0x7fffddb6d220, ra = 0x410b7f
   2: Foreign function gc_heap_exhausted_error_or_lose, fp = 0x7fffddb6d250, ra = 0x41d0ba
   3: Foreign function gc_find_freeish_pages, fp = 0x7fffddb6d2a0, ra = 0x41d42f
   4: Foreign function (null), fp = 0x7fffddb6d2f0, ra = 0x41d96c
   5: Foreign function gc_alloc_with_region, fp = 0x7fffddb6d330, ra = 0x41dd31
   6: Foreign function (null), fp = 0x7fffddb6d370, ra = 0x41e1f0
   7: Foreign function (null), fp = 0x7fffddb6d3a0, ra = 0x40db45
   8: Foreign function scavenge, fp = 0x7fffddb6d3e0, ra = 0x40d8a4
   9: Foreign function (null), fp = 0x7fffddb6d400, ra = 0x41c0f2
  10: Foreign function collect_garbage, fp = 0x7fffddb6d490, ra = 0x420a29
  11: SB-KERNEL::COLLECT-GARBAGE
  12: (COMMON-LISP::FLET WITHOUT-GCING-BODY-60 KEYWORD::IN SB-KERNEL::SUB-GC)
  13: (COMMON-LISP::FLET SB-THREAD::EXEC KEYWORD::IN SB-KERNEL::SUB-GC)
  14: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-53 KEYWORD::IN SB-KERNEL::SUB-GC)
  15: SB-KERNEL::SUB-GC
  16: Foreign function call_into_lisp, fp = 0x7fffddb6d7d0, ra = 0x42166f
  17: Foreign function maybe_gc, fp = 0x7fffddb6d800, ra = 0x41061b
  18: Foreign function interrupt_handle_pending, fp = 0x7fffddb6d830, ra = 0x413079
  19: Foreign function handle_trap, fp = 0x7fffddb6d870, ra = 0x413bb5
  20: Foreign function (null), fp = 0x7fffddb6d8b0, ra = 0x4113cf
  21: Foreign function (null), fp = 0x7fffddb6dd48, ra = 0x7ffff79ca100
  22: QMYND-IMPL::PARSE-RESULTSET-ROW
  23: QMYND-IMPL::MAP-RESULTSET-ROWS
  24: QMYND-IMPL::SEND-COMMAND-QUERY
  25: QMYND::MYSQL-QUERY
  26: (SB-PCL::FAST-METHOD PGLOADER.SOURCES::MAP-ROWS (PGLOADER.MYSQL::COPY-MYSQL))
  27: (SB-PCL::FAST-METHOD PGLOADER.SOURCES::QUEUE-RAW-DATA (PGLOADER.SOURCES::COPY COMMON-LISP::T))
  28: (COMMON-LISP::FLET BODY-FN621 KEYWORD::IN LPARALLEL.KERNEL::MAKE-CHANNELED-TASK)
  29: (COMMON-LISP::LAMBDA () KEYWORD::IN LPARALLEL.KERNEL::MAKE-CHANNELED-TASK)
  30: (SB-C::XEP (COMMON-LISP::LAMBDA () KEYWORD::IN LPARALLEL.KERNEL::MAKE-CHANNELED-TASK))
  31: LPARALLEL.KERNEL::EXEC-TASK/WORKER
  32: LPARALLEL.KERNEL::WORKER-LOOP
  33: LPARALLEL.KERNEL::%CALL-WITH-TASK-HANDLER
  34: (COMMON-LISP::LAMBDA () KEYWORD::IN LPARALLEL.KERNEL::CALL-WITH-WORKER-CONTEXT)
  35: LPARALLEL.KERNEL::CALL-WITH-WORKER-CONTEXT
  36: (COMMON-LISP::LAMBDA () KEYWORD::IN LPARALLEL.KERNEL::MAKE-WORKER-THREAD)
  37: (COMMON-LISP::LAMBDA () KEYWORD::IN BORDEAUX-THREADS::BINDING-DEFAULT-SPECIALS)
  38: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-1171 KEYWORD::IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE)
  39: (COMMON-LISP::FLET SB-THREAD::WITH-MUTEX-THUNK KEYWORD::IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE)
  40: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-639 KEYWORD::IN SB-THREAD::CALL-WITH-MUTEX)
  41: SB-THREAD::CALL-WITH-MUTEX
  42: SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE
  43: Foreign function call_into_lisp, fp = 0x7fffddb6eef0, ra = 0x42166f
  44: Foreign function new_thread_trampoline, fp = 0x7fffddb6ef10, ra = 0x418ae7
ldb>
d4be4st commented 7 years ago

Have a similar problem with

pgloader version "3.3.1c927be"
compiled with SBCL 1.3.11

It just hangs at one point

I have gone back to

pgloader version "3.3.2"
compiled with SBCL 1.3.11

And it passed nicely

This is the line that never appeared in the 3.3.1c927be version, hangs just before it:

2016-12-13T14:12:26.945000+01:00 NOTICE Reset sequences

EDIT: spoke too soon. The second time passed because the target database did not have any indexes from the last import which i interrupted.

Might be connected with https://github.com/dimitri/pgloader/issues/420

mfn commented 7 years ago

FTR, my migration got stuck too on Postgres 9.6 with

pgloader version "3.3.1"
compiled with SBCL 1.2.4.debian

I checked in postgres that it was stuck in the COMMIT phase:

=> SELECT pid, age(query_start, clock_timestamp()), usename, query
FROM pg_stat_activity
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
 pid  |       age        | usename |          query
------+------------------+---------+--------------------------
 4257 |                  | monitor | <insufficient privilege>
 4256 |                  | monitor | <insufficient privilege>
 4258 |                  | monitor | <insufficient privilege>
 5292 | -17:03:27.903784 | project  | COMMIT
(4 rows)

Hitting ^C didn't give my much information, only

$  time pgloader pgloader-project.load
2016-12-27T15:57:33.042000Z LOG Main logs in '/tmp/pgloader/pgloader.log'
2016-12-27T15:57:33.044000Z LOG Data errors in '/tmp/pgloader/'
2016-12-27T15:57:33.045000Z LOG Parsing commands from file #P"/appl/project/pgloader-project.load"
2016-12-27T16:01:40.807000Z WARNING PostgreSQL warning: identifier "idx_107511181_idx_unique_client_member_channels_channel_id_client_member_id" will be truncated to "idx_107511181_idx_unique_client_member_channels_channel_id_clie"

2016-12-27T16:59:25.794000Z ERROR Database error 22021: invalid byte sequence for encoding "UTF8": 0x00
CONTEXT: COPY comments, line 381
2016-12-27T16:59:25.821000Z ERROR Database error 22021: invalid byte sequence for encoding "UTF8": 0x00
CONTEXT: COPY comments, line 120
2016-12-27T16:59:25.822000Z ERROR Database error 22021: invalid byte sequence for encoding "UTF8": 0x00
CONTEXT: COPY comments, line 183
2016-12-27T16:59:26.022000Z ERROR Database error 22021: invalid byte sequence for encoding "UTF8": 0x00
CONTEXT: COPY comments, line 1
2016-12-27T17:11:11.085000Z WARNING PostgreSQL warning: identifier "idx_107511224_idx_comment_post_id_parent_id_number_is_or_has_unseen" will be truncated to "idx_107511224_idx_comment_post_id_parent_id_number_is_or_has_un"

^CAn unhandled error condition has been signalled:
   Interactive interrupt at #x7FFFF71C8707.

An unhandled error condition has been signalled:
   Interactive interrupt at #x7FFFF71C8707.

real    1125m47.541s
user    750m39.538s
sys     396m38.940s

The file /tmp/pgloader/pgloader.log didn't reveal anything related to this, only some warnings because of too long identifiers.

I'm in the process of trying a newer version/compile the current one. I may also run different strategies with the current version and will report back.

mfn commented 7 years ago

I tried the offending table (a very big table, in MySQL 75GB and ~60 million rows) standalone and it did not hang.

I looked into the respective postgres process this time using strace -p ... (using the PID form the SQL query above) and only got:

=> SELECT pid, age(query_start, clock_timestamp()), usename, query
FROM pg_stat_activity
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
  pid  |       age        | usename |          query
-------+------------------+---------+--------------------------
 13165 |                  | monitor | <insufficient privilege>
 13164 |                  | monitor | <insufficient privilege>
 13166 |                  | monitor | <insufficient privilege>
 22792 | -00:29:16.337757 | project  | COMMIT
(4 rows)
strace -p 22792
Process 22792 attached
epoll_wait(3,

fd 3 refers just to lrwx------ 1 postgres postgres 64 Dec 28 17:06 /proc/22792/fd/3 -> anon_inode:[eventpoll]

One of the pgloader processes/threads however was going full-circle polling here ( identified two entries via htop consuming much CPU):

strace -p 20021
Process 20021 attached
futex(0x1012d8bbb8, FUTEX_WAIT_PRIVATE, 157279619, NULL

Nothing happening here. And the other one:

strace -p 20025 2>&1 | head
Process 20025 attached
read(8, "", 4096)                       = 0
poll([{fd=8, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=8, revents=POLLIN}])
read(8, "", 4096)                       = 0
poll([{fd=8, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=8, revents=POLLIN}])
read(8, "", 4096)                       = 0
poll([{fd=8, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=8, revents=POLLIN}])
read(8, "", 4096)                       = 0
poll([{fd=8, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=8, revents=POLLIN}])
read(8, "", 4096)                       = 0

Basically an endless loop from what I saw: lrwx------ 1 mfischer mfischer 64 Dec 28 17:08 /proc/20025/fd/8 -> socket:[179374] And

lsof|grep 179374
pgloader  20021         user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20022   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20024   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20025   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20026   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20027   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20028   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20029   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20030   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20031   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20032   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20033   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20034   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20035   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20036   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20037   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20038   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20039   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20040   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20041   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20042   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20043   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20044   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20045   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)
pgloader  20021 20046   user    8u     IPv4             179374       0t0        TCP project-pgsql02:51304->a.b.c.d:mysql (CLOSE_WAIT)

Btw, compared to my other comment https://github.com/dimitri/pgloader/issues/337#issuecomment-269461637 , this one was tested with 3.3.2

mfn commented 7 years ago

Here's the config I used this time:

load database
 from mysql://...
 into postgresql://...

WITH
 batch rows = 1000

CAST
 type varchar to varchar,
 type bigint with extra auto_increment to bigserial drop typemod,
 type bigint to bigint drop typemod
;

I had to use batch rows = 1000 because otherwise I would get https://github.com/dimitri/pgloader/issues/487

mfn commented 7 years ago

As others said, creating indices are not involved in this "hang"-problem. In my tests I still disable them because it brings me faster into the "hung"-state.

When performing the migration on multiple tables, I always see it hang on the first "large" table. When I iteratively remove those large tables, it hangs on the next, etc.

The "large" tables have rows like 11 Mio, 8 Mio, 60 Mio, 27 Mio.

Tables which don't exhibit this problem are "smaller", like 4 Mio, 1 Mio, etc.

Also worth mentioning explicitly: when pgloader is hanging and I stop pgloader with ^C, it seems the table itself is properly imported.

poppingtonic commented 7 years ago

I see the same problem when loading a public dataset: RxNorm. The zip file contains mysql load scripts and a wrapper shell script. To support faster writes through COPY, I use the following settings:

work_mem = 256MB
maintenance_work_mem = 1GB

The largest table is copied succesfully, but a smaller one rxnsty hangs:

SELECT pid, age(query_start, clock_timestamp()), usename, query
FROM pg_stat_activity
WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
 pid  |       age        |   usename   | query
------+------------------+-------------+--------
 5326 | -20:16:32.850028 | terminology | COMMIT
(1 row)

Edit: Loading the problematic table individually using:

LOAD DATABASE
     FROM mysql://root:password@localhost:3306/rxnorm
     INTO postgres://terminology:terminology@localhost:5432/rxnorm_loinc

 WITH data only,
      workers = 8, concurrency = 1,
      batch rows = 1000,
      include no drop,
      create no tables

  SET maintenance_work_mem to '128MB',
      work_mem to '12MB'

 INCLUDING ONLY TABLE NAMES MATCHING 'RXNSTY';

worked out quite well, but I still don't know where the problem was.