stability: client hangs after one node hits disk errors

ramanala commented 8 years ago

I start a 3-node cluster using a small script like this:

cockroach start --store=data0 --log-dir=log0 cockroach start --store=data1 --log-dir=log1 --port=26258 --http-port=8081 --join=localhost:26257 --join=localhost:26259 & cockroach start --store=data2 --log-dir=log2 --port=26259 --http-port=8082 --join=localhost:26257 --join=localhost:26258 & sleep 5

At this point, I see that the cluster is setup correctly and I can start inserting and reading data out. So far so good.

Now, during this startup, node2 (the node with data2 as its store) hits a disk full error (-ENOSPC) or a -EIO error when trying to append to a SSTable file and fails to start. At this point, expectation is that the cluster continues to accept client connections and make progress with transactions (as majority of nodes are available). This is my client code:

server_ports = ["26257", "26258", "26259"]
server_id = 0
for port in server_ports:
    try:
        conn = psycopg2.connect(host="localhost", port=port, database = "mydb", user="xxx", connect_timeout=10)
        conn.set_session(autocommit=True)
        cur = conn.cursor()
        cur.execute("SELECT * FROM mytable;")
        record = cur.fetchall()
            print result
        cur.close()
        conn.close()
    except Exception as e:
            print 'Exception:' + str(e)
        time.sleep(3)

I see that the client successfully connected to node0 but hangs after that (in the execute statement) atleast for 30 seconds (after which I abort the thread running this above code).

I am not sure what is going wrong here. Shouldn't the client be able to talk to node0 or node1 irrespective of node2's failure? I am not sure who was the leader for this table before node2 crashed. Irrespective of who was the old leader, shouldn't the cluster automatically elect a new leader (if node2 were the old leader) and continue to make progress in any case?

Moreover, this is not a one-off case. This happens when one node hits a disk error (EIO, EDQUOT, ENOSPC) at several points in time and trying to access different files such as MANIFEST, sst, dbtmp.

Expectation: Whatever error a single node encounter, as long as the majority is available, the cluster needs to make progress (agreeing upon a new leader if the failed node was the old leader).

Observed: When one node hits a disk error, sometimes (depending on error type and file being accessed when the error happened), the cluster becomes unusable as clients start blocking after connecting to one of the remaining nodes that did not encounter any errors.

I have attached the server logs. I would be happy with providing any further details.

logs.tar.gz

ramanala commented 8 years ago

Please notice that subsequent requests go through fine if retried after waiting for few tens of seconds. But the original request blocks forever. So this might be something to do with how client handles connections and NOT a server side problem. Updating title to reflect this.

petermattis commented 7 years ago

A lot of changes have been made in lease handling (which is likely the cause of the tens of second wait before success), but handling of of disk-full errors deserves further attention in 1.1.

petermattis commented 7 years ago

@dianasaur323 We might want to schedule time to look at this in 1.2. Enhancements in this area are now beyond reach for 1.1 give the current point in the cycle.

dianasaur323 commented 7 years ago

thanks for the heads up, @petermattis. Do you think this merits a t-shirt size in Airtable, or should I just allocate time for this to be addressed as part of bugfixing?

petermattis commented 7 years ago

@dianasaur323 This is more than just a big fix as there are many code paths to inspect and a bit of design work to be done. At least an M size project. Probably an L.

dianasaur323 commented 7 years ago

Great, thanks @petermattis. I added in a placeholder in Airtable. That being said, this squarely falls into @awoods187's space, correct @nstewart?

nstewart commented 7 years ago

Thats correct

dianasaur323 commented 7 years ago

kk, just assigned him in airtable as well.

tbg commented 6 years ago

Closing this for https://github.com/cockroachdb/cockroach/issues/19656#issuecomment-390660438, which proposes a test plan that includes the scenario outlined here.

tbg commented 6 years ago

Decided to reopen this for better visibility.

tbg commented 5 years ago

In addressing this, we should also look at the case in which the log directory is on a separate partition and fills up/becomes otherwise unavailable. See https://github.com/cockroachdb/cockroach/issues/7646

tbg commented 5 years ago

Some more discussion in https://github.com/cockroachdb/cockroach/issues/8473.

awoods187 commented 5 years ago

I think we should not die when we run out of disk space. Instead, we should stop writing to disk and allow for users to add more disk. Thoughts?

petermattis commented 5 years ago

@awoods187 That behavior is clearly desirable, though it is much more difficult to achieve than it is to express. To give some sense of the difficulty, removing a replica from a node involves writing a small amount of data to disk to indicate that the replica was removed before the replica's data is actually deleted. In other words, we need to write to disk in order to free up space on disk.

cockroachdb / cockroach

stability: client hangs after one node hits disk errors #7882