discoproject / disco

a Map/Reduce framework for distributed computing
http://discoproject.org
BSD 3-Clause "New" or "Revised" License
1.63k stars 241 forks source link

GC failure after removing a dead node #638

Closed gilessbrown closed 7 years ago

gilessbrown commented 8 years ago

As described in a disco-dev (google group)[https://groups.google.com/forum/#!topic/disco-dev/yrrSwexLWkQ] post I had a problem where GC was failing with an "unable to get tag" message in the logs.

The tag for which this message was produced contained a urls for a blob with a replica on a node that had died and been removed from the cluster.

I wrote a script to manually remove the blob references to the dead node and the GC was able to run again.

The GC should not require this manual step to recover from the removal of a node, right?

gilessbrown commented 7 years ago

OK, so the right answer here is that I should have blacklisted the node from DDFS and then waited until the node was clear as described here: http://disco.readthedocs.io/en/latest/howto/administer.html#blacklisting-a-ddfs-node