HazyResearch / deepdive

DeepDive
deepdive.stanford.edu
1.96k stars 536 forks source link

Greenplum occasionally hangs with DD8 #514

Closed raphaelhoffmann closed 8 years ago

raphaelhoffmann commented 8 years ago

We are running DD8 with Greenplum on Ubuntu. Over the last few weeks, @zifeishan and I observed that in a small number of cases (maybe 1 out of 50) Greenplum gets stuck when running a DD8 application.

When that happens, pg_stat_activity shows a running query which is not in a waiting state. However, there is no I/O and no CPU activity. It is not possible to kill the query using pg_cancel_backend or pg_terminate_backend. One needs to kill -9 the query process; sometimes, another exclusive lock is held by another process as shown by

SELECT locktype, DATABASE, pid, MODE, GRANTED, relname, gp_segment_id
    FROM pg_locks l
    LEFT OUTER JOIN pg_class t ON l.relation = t.oid
    ORDER BY MODE;

We need to run kill -9 for that second query as well. Usually that turns the database back into a working condition; in a few rare cases, the database was corrupt.

We believe that this situation usually happens with COPY FROM STDIN queries; it looks like Greenplum continues to wait for the sender of the data, but the sender is idle.

A recent commit on Greenplum master branch appears to make it easier to kill queries in such state https://github.com/greenplum-db/gpdb/commit/63dd5a6c7202d3458773d200074d1edeaf1b15b7. However, the actual bug must be on the sender's side. One hypothesis (by @netj) is that mkmimo is not handling some error conditions correctly.

alldefector commented 8 years ago

@netj: we are still living with this heisenbug... Does it look like it could be mkmimo?

netj commented 8 years ago

@alldefector Nope. Back then @zifeishan and I confirmed this was happening on a data path that involves just gpdb's psql.

netj commented 8 years ago

This no longer seems to be the issue