Fix migration startup race condition

vncoelho commented 6 years ago

Hi, @adrienmo,

Sometime, unexpectedly, the neo-scan api and sync dockers are crashing here in our experiments. I believe it is because we are making several requests and docker were not fully ready (in general it happens when no block was yet synced). What do you think? One of the logs is:

2018-10-16 15:07:02.794 [info] [neoscan@127.0.0.1] 0.2 blocks/s, 7280.0 download avg time, 8226.0 insert avg time
2018-10-16 15:07:07.795 [info] [neoscan@127.0.0.1] 0.2 blocks/s, 4328.0 download avg time, 3143.0 insert avg time
2018-10-16 15:07:12.796 [info] [neoscan@127.0.0.1] 0.2 blocks/s, 4166.0 download avg time, 11281.0 insert avg time
2018-10-16 15:07:13.324 [error] [neoscan@127.0.0.1] Postgrex.Protocol (#PID<0.1868.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
2018-10-16 15:07:13.324 [error] [neoscan@127.0.0.1] Postgrex.Protocol #PID<0.1868.0> could not cancel backend: tcp connect: connection refused - :econnrefused
2018-10-16 15:07:13.325 [error] [neoscan@127.0.0.1] #PID<0.20541.21> running NeoscanWeb.Endpoint (cowboy_protocol) terminated
Server: neoscan.neocompiler.io:80 (http)
Request: GET /api/main_net/v1/get_height
** (exit) an exception was raised:
    ** (DBConnection.ConnectionError) tcp recv: closed
        (ecto) lib/ecto/adapters/postgres/connection.ex:108: Ecto.Adapters.Postgres.Connection.execute/4
        (ecto) lib/ecto/adapters/sql.ex:256: Ecto.Adapters.SQL.sql_call/6
        (ecto) lib/ecto/adapters/sql.ex:436: Ecto.Adapters.SQL.execute_or_reset/7
        (ecto) lib/ecto/repo/queryable.ex:133: Ecto.Repo.Queryable.execute/5
        (ecto) lib/ecto/repo/queryable.ex:37: Ecto.Repo.Queryable.all/4
        (ecto) lib/ecto/repo/queryable.ex:70: Ecto.Repo.Queryable.one/4
        (neoscan_web) lib/neoscan_web/api.ex:210: NeoscanWeb.Api.get_height/0
        (neoscan_web) lib/neoscan_web/controllers/api_controller.ex:120: NeoscanWeb.ApiController.get_height/2
2018-10-16 15:07:13.330 [error] [neoscan@127.0.0.1] Postgrex.Protocol (#PID<0.1868.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (postgres:5432): connection refused - :econnrefused
2018-10-16 15:07:13.424 [error] [neoscan@127.0.0.1] GenServer NeoscanSync.Syncer terminating
** (DBConnection.ConnectionError) tcp recv: closed
    (ecto) lib/ecto/adapters/postgres/connection.ex:108: Ecto.Adapters.Postgres.Connection.execute/4
    (ecto) lib/ecto/adapters/sql.ex:256: Ecto.Adapters.SQL.sql_call/6
    (ecto) lib/ecto/adapters/sql.ex:436: Ecto.Adapters.SQL.execute_or_reset/7
    (ecto) lib/ecto/repo/queryable.ex:133: Ecto.Repo.Queryable.execute/5
    (ecto) lib/ecto/repo/queryable.ex:37: Ecto.Repo.Queryable.all/4
    (ecto) lib/ecto/repo/queryable.ex:70: Ecto.Repo.Queryable.one/4
    (neoscan) lib/neoscan/blocks/blocks.ex:93: Neoscan.Blocks.get_max_index/0
    (neoscan_sync) lib/neoscan_sync/syncer.ex:29: NeoscanSync.Syncer.get_available_block_index_range/0
Last message: :sync
2018-10-16 15:07:13.425 [error] [neoscan@127.0.0.1] Postgrex.Protocol (#PID<0.1851.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
2018-10-16 15:07:13.425 [error] [neoscan@127.0.0.1] Postgrex.Protocol #PID<0.1851.0> could not cancel backend: tcp connect: connection refused - :econnrefused
2018-10-16 15:07:13.426 [error] [neoscan@127.0.0.1] Postgrex.Protocol (#PID<0.1870.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
2018-10-16 15:07:13.427 [error] [neoscan@127.0.0.1] Postgrex.Protocol #PID<0.1870.0> could not cancel backend: tcp connect: connection refused - :econnrefused
2018-10-16 15:07:13.427 [error] [neoscan@127.0.0.1] Postgrex.Protocol (#PID<0.1876.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
2018-10-16 15:07:13.428 [error] [neoscan@127.0.0.1] Postgrex.Protocol #PID<0.1876.0> could not cancel backend: tcp connect: connection refused - :econnrefused
2018-10-16 15:07:13.429 [error] [neoscan@127.0.0.1] Postgrex.Protocol (#PID<0.1877.0>) disconnected: ** (DBConnection.ConnectionError) tcp recv: closed
2018-10-16 15:07:13.429 [error] [neoscan@127.0.0.1] Postgrex.Protocol #PID<0.1877.0> could not cancel backend: tcp connect: connection refused - :econnrefused
2018-10-16 15:07:13.429 [info] [neoscan@127.0.0.1] Application neoscan_sync exited: shutdown
{"Kernel pid terminated",application_controller,"{application_terminated,neoscan_sync,shutdown}"}
Kernel pid terminated (application_controller) ({application_terminated,neoscan_sync,shutdown})

Crash dump is being written to: erl_crash.dump...done

adrienmo commented 6 years ago

Hi @vncoelho are both of containers crashing at the same time? It seems postgres becomes not available. Is it just at startup or can it happen after a long period of running well?

vncoelho commented 6 years ago

API container

Loading neoscan..
Starting dependencies..
Starting repos..
Running migrations for neoscan
2018-10-27 13:35:29.907 [info] [nonode@nohost] == Running Neoscan.Repo.Migrations.Counters.change/0 forward
2018-10-27 13:35:29.909 [info] [nonode@nohost] create table counters_cached
{"init terminating in do_boot",{#{'__exception__'=>true,'__struct__'=>'Elixir.Postgrex.Error',connection_id=>80,message=>nil,postgres=>#{code=>duplicate_table,file=><<"heap.c">>,line=><<"1067">>,message=><<"relation \"counters_cached\" already exists">>,pg_code=><<"42P07">>,routine=><<"heap_create_with_catalog">>,severity=><<"ERROR">>,unknown=><<"ERROR">>}},[{'Elixir.Ecto.Adapters.SQL','query!',5,[{file,"lib/ecto/adapters/sql.ex"},{line,200}]},{'Elixir.Ecto.Adapters.Postgres','-execute_ddl/3-fun-0-',4,[{file,"lib/ecto/adapters/postgres.ex"},{line,96}]},{'Elixir.Enum','-reduce/3-lists^foldl/2-0-',3,[{file,"lib/enum.ex"},{line,1925}]},{'Elixir.Ecto.Adapters.Postgres',execute_ddl,3,[{file,"lib/ecto/adapters/postgres.ex"},{line,96}]},{'Elixir.Ecto.Migration.Runner','-flush/0-fun-1-',2,[{file,"lib/ecto/migration/runner.ex"},{line,104}]},{'Elixir.Enum','-reduce/3-lists^foldl/2-0-',3,[{file,"lib/enum.ex"},{line,1925}]},{'Elixir.Ecto.Migration.Runner',flush,0,[{file,"lib/ecto/migration/runner.ex"},{line,102}]},{timer,tc,2,[{file,"timer.erl"},{line,181}]}]}}
init terminating in do_boot ({,[{Elixir.Ecto.Adapters.SQL,query!,5,[{_},{_}]},{Elixir.Ecto.Adapters.Postgres,-execute_ddl/3-fun-0-,4,[{_},{_}]},{Elixir.Enum,-reduce/3-lists^foldl/2-0-,3,[{_},{_}]},{El

Postgres:

/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*

waiting for server to shut down....2018-10-27 13:35:21.180 UTC [43] LOG:  received fast shutdown request
2018-10-27 13:35:21.182 UTC [43] LOG:  aborting any active transactions
2018-10-27 13:35:21.185 UTC [43] LOG:  worker process: logical replication launcher (PID 50) exited with exit code 1
2018-10-27 13:35:21.185 UTC [45] LOG:  shutting down
2018-10-27 13:35:21.221 UTC [43] LOG:  database system is shut down
 done
server stopped

PostgreSQL init process complete; ready for start up.

2018-10-27 13:35:21.320 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2018-10-27 13:35:21.320 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2018-10-27 13:35:21.330 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2018-10-27 13:35:21.355 UTC [61] LOG:  database system was shut down at 2018-10-27 13:35:21 UTC
2018-10-27 13:35:21.365 UTC [1] LOG:  database system is ready to accept connections
2018-10-27 13:35:29.909 UTC [80] ERROR:  relation "counters_cached" already exists
2018-10-27 13:35:29.909 UTC [80] STATEMENT:  CREATE TABLE "counters_cached" ("name" varchar(255), "ref" bytea, "value" integer NOT NULL, PRIMARY KEY ("name", "ref"))

adrienmo commented 6 years ago

@vncoelho there is a known issue, if postgres is empty and both container start at the same time, both will try to run the migrations, it is possible it might put the db in a weird shape and be unusable. But this is for the starting case. It seems it is what shows your latest post.

For the initial error you posted, it seems the system was running fine, so I expect migration were done properly and suddenly db connection dropped?

vncoelho commented 6 years ago

I think that you are right, @adrienmo. In general, this error happens in the start-up of a Private Network, as you imagined.

Maybe we could insert a delay in the start up of Both Containers (api and sync) for until postgres has, at least, one block. What do you think?

adrienmo commented 6 years ago

@vncoelho I think best would be to make the migration safe for race condition, or maybe only make one container running the migration, I will check what are the options

vncoelho commented 6 years ago

Hey, @adrienmo

Thanks, no hurry on this anyway. It is just a minnor incident in these specific conditions. In addition, it seams that there is also a problem on Docker sometimes when it creates a lot of containers with docker-compose. I reported on docker-compose but they refused the problem and asked me to direct to the Docker repository...

Let's see how it flows.

Sometimes we are needing to use service docker reestart in order to really get a connection between nodes.

We keep in touch,

CityOfZion / neo-scan

Fix migration startup race condition #358