Failure of underlying HDFS NameNode

dsalychev commented 9 years ago

I'm wondering if failure of underlying NameNode from HDFS can crash underlying HBase and all of them will be a reason for Giraffa to stop working. Am I right? Or there is a special trick?

pjeli commented 9 years ago

Today that is an issue. In the future we will remove the NameNode entirely.

In Giraffa, the role of the NameNode is purely as a "Block Manager". It only need to know the block id, length, and genStamp of blocks reported to it by the DataNodes. It aggregates this block meta info and acts as a central authority for RegionServer(s) to talk to. In the end, "Block Managers" store no persistent data as all their meta data is restored from DataNode block reports. Again, meta data is stored in HBase.

We have several ideas about what we want to happen in the event of "Block Manager" death, but a more complex solution is to allow multiple active-active "Block Managers". The basic idea is that RegionServers should be able to map "Block Managers" to their respective regions.

As for HBase data, that will be handled specially by a new notion of "Super Blocks" (blocks that must be loaded first in order to ensure the File System can start). "Super Blocks" will need to exist on all Block Managers as they comprise of all data that used to be under "/hbase" in HDFS.

dsalychev commented 9 years ago

I understood a few "hot points" in development, I guess.

I think that it's a good place to repeat Konstantin's answer to the same question (I asked him via email):

Good question. Giraffa's design does not assume any SPOF at all, even though current implementation does rely on the single NameNode.
In current implementation we use NameNode as a block manager. The idea is to replace it with a block management layer, each RegionServer will have a local (running on the same node) block manager. So everything will be partitioned.
We also plan to eliminate HMaster, which is a SPOF of HBase.

I'd like to keep all information about SPOF problem in Giraffa's context in one place. It might be wiki or something other. Is it a good idea?

shvachko commented 9 years ago

Yes, this is one of the "hot points" for Giraffa. But we first need to finish bootstrapping (issue #16). This will make HBase not rely on HDFS for storing its files. Then NameNode's role for Giraffa will be reduced to block management only, and we can replace it with the distributed layer. The two efforts can go in parallel of course, but the dependency still remains.

dsalychev commented 9 years ago

My idea is to build a layer of distributed BMs on top of distributed hash table. Each BM server, which is hosted alongside HBase Region server, represents a node in DHT. Now BlockManagementAgent relies upon DistributedFileSystem to work with blocks. I think that it's possible to update BlockManagementAgent to use some DHT client to achieve this:

When the BlockManagementAgent needs to access a block b1, it contacts the respective BMServer referring to a file named “gblk_b1”. (c) "Outdated design doc"

substitute with

When the BlockManagementAgent needs to access a block b1, it ask DHT client to retrieve this block by using dhtClient.get(hash("gblk_b1")).

Pros:

DHT doesn't have SPOF by design;
DHT is, obviously, distributed by design;
Locality-preserving hashing;
- We can use it to achieve locality of "similar" blocks;
- "Typically a particular BM server maintains blocks belonging to files stored in the NSTable partitions assigned to the RegionServer running on the same node." (c) "Outdated design doc"

Cons:

DHT provides a limited set of methods (so put(), get()), I'm not sure that it'll be enough for BlockManagementAgent needs;
I'm not sure about good DHT implementations, there are some (DKS, TomP2P)
I'm not sure that it's possible to avoid persisting in DHT;
- "BMServer does not need to persist its flat namespace. It can reconstruct the block namespace by matching blocks stored in respective NSTable partitions with replicas reported by DataNodes. The matching can be performed at the startup time." (c) "Outdated design doc"

Any thoughts?

shvachko commented 9 years ago

Good you asked, Dmitry.

We actually looked at a few hashing approaches for a distributed implementation of the flat block id namespace, like

Consistent hashing and
Locality-preserving hashing that you mentioned

We decided to start with the NameNode and gradually evolve it to giraffa BlockManager (GBM) by adding needed new functionality and removing the one giraffa does not need. Here are some thoughts:

Block Management Functions

There are three groups of functions that GBMs need to support:

API to create, delete, and update blocks from a GBM client
Internal block maintenance: block replication, enforcing block placement policy, corrupt or missing blocks maintenance.
DataNode maintenance: processing DataNode block reports, heartbeats, registrations.

Current NameNode does not expose block creation and deletion API (1), but does the other two (2) and (3) quite well. So we implemented (1), currently for a single NameNode.

Current Implementation of Block Creation

Currently block creation is dome in two steps:

BlockManagementAgent creates a temporary file and allocates one empty block in it.
Then immediately renames the temp file with a new name, which has the block id in the name.

Giraffa block deletion thus is implemented as removal of the corresponding file, which triggers the actual block replica removal on the DataNodes.

One of the immediate improvements would be adding a new API that merges the two block creation steps in one atomic.

Block Creation with Multiple NameNodes as GBM Servers

We can use multiple NameNodes so that each of them maintains its own subset of blocks. It is like HDFS federation, where several NameNodes share the same pool of DataNodes. We thought to use a Coordination Engine of HADOOP-10641 for block creation in the following way:

Each NN obtains a range of IDs from the Coordination Engine (CE).
CE makes sure that the ranges are disjoint between NNs.
NN then sequentially generates new block IDs from that range. When the range is exhausted NN obtains a new range from CE.

That way a BlockManagementAgent can obtain new blocks from a local GBM, represented by a local NN. It also provides locality of the blocks belonging to the files stored in the a local region. After a restart some blocks may become remote. We may need to build a relocation utility for such blocks in the future.

So I see two immediate tasks in the direction of evolving NameNode into GBM:

Defining GBM client API, and implementing atomic block creation on the NameNode
Implement multiple NameNode setup to support disjoint collections of blocks. Initial implementation can be as naive as using large fixed ranges of block IDs per NN. Until we introduce CE into Giraffa.

Hope this makes sense?

dsalychev commented 9 years ago

Sure, it makes sense :)

Is there a way to implement RPC server and plug it in NameNode easily? Right now I see only a way to extend existing NameNode and provide an implementation of protected void initialize(Configuration conf) method to set up and run RPC server.

My idea is to define GiraffaBlockManagementRpcServer (gbmRpcServer) and provide a method called allocateBlock() for (1) task. Early implementation will use NameNode for block management operations, but I think that it'll be possible to replace it.

Do I understand atomic block creation correctly? Block should be allocated via one RPC request to a NN/GBM? Or I have to dig it deeply for allocating blocks without creating files?

shvachko commented 9 years ago

I was thinking how we can introduce new calls into NameNode. In ConsensusNode we did replace the NN rpsServer similarly to what you propose, but it was somewhat tricky. May be we can avoid it. Let me open a new issue for task (1) so that we could focus on discussing just that there. Atomic block creation means single call to NN. During this call NN creates a file, allocates a block and assigns the blockId as the file name.

GiraffaFS / giraffa