Open dsalychev opened 9 years ago
Today that is an issue. In the future we will remove the NameNode entirely.
In Giraffa, the role of the NameNode is purely as a "Block Manager". It only need to know the block id, length, and genStamp of blocks reported to it by the DataNodes. It aggregates this block meta info and acts as a central authority for RegionServer(s) to talk to. In the end, "Block Managers" store no persistent data as all their meta data is restored from DataNode block reports. Again, meta data is stored in HBase.
We have several ideas about what we want to happen in the event of "Block Manager" death, but a more complex solution is to allow multiple active-active "Block Managers". The basic idea is that RegionServers should be able to map "Block Managers" to their respective regions.
As for HBase data, that will be handled specially by a new notion of "Super Blocks" (blocks that must be loaded first in order to ensure the File System can start). "Super Blocks" will need to exist on all Block Managers as they comprise of all data that used to be under "/hbase" in HDFS.
I understood a few "hot points" in development, I guess.
I think that it's a good place to repeat Konstantin's answer to the same question (I asked him via email):
Good question. Giraffa's design does not assume any SPOF at all, even though current implementation does rely on the single NameNode.
In current implementation we use NameNode as a block manager. The idea is to replace it with a block management layer, each RegionServer will have a local (running on the same node) block manager. So everything will be partitioned.
We also plan to eliminate HMaster, which is a SPOF of HBase.
I'd like to keep all information about SPOF problem in Giraffa's context in one place. It might be wiki or something other. Is it a good idea?
Yes, this is one of the "hot points" for Giraffa. But we first need to finish bootstrapping (issue #16). This will make HBase not rely on HDFS for storing its files. Then NameNode's role for Giraffa will be reduced to block management only, and we can replace it with the distributed layer. The two efforts can go in parallel of course, but the dependency still remains.
My idea is to build a layer of distributed BMs on top of distributed hash table. Each BM server, which is hosted alongside HBase Region server, represents a node in DHT. Now BlockManagementAgent
relies upon DistributedFileSystem
to work with blocks. I think that it's possible to update BlockManagementAgent
to use some DHT client to achieve this:
When the BlockManagementAgent
needs to access a block b1, it contacts the respective BMServer referring to a file named “gblk_b1”. (c) "Outdated design doc"
substitute with
When the BlockManagementAgent
needs to access a block b1, it ask DHT client to retrieve this block by using dhtClient.get(hash("gblk_b1"))
.
Pros:
Cons:
BlockManagementAgent
needs;Any thoughts?
Good you asked, Dmitry.
We actually looked at a few hashing approaches for a distributed implementation of the flat block id namespace, like
We decided to start with the NameNode and gradually evolve it to giraffa BlockManager (GBM) by adding needed new functionality and removing the one giraffa does not need. Here are some thoughts:
There are three groups of functions that GBMs need to support:
Current NameNode does not expose block creation and deletion API (1), but does the other two (2) and (3) quite well. So we implemented (1), currently for a single NameNode.
Currently block creation is dome in two steps:
BlockManagementAgent
creates a temporary file and allocates one empty block in it.Giraffa block deletion thus is implemented as removal of the corresponding file, which triggers the actual block replica removal on the DataNodes.
One of the immediate improvements would be adding a new API that merges the two block creation steps in one atomic.
We can use multiple NameNodes so that each of them maintains its own subset of blocks. It is like HDFS federation, where several NameNodes share the same pool of DataNodes. We thought to use a Coordination Engine of HADOOP-10641 for block creation in the following way:
That way a BlockManagementAgent
can obtain new blocks from a local GBM, represented by a local NN. It also provides locality of the blocks belonging to the files stored in the a local region. After a restart some blocks may become remote. We may need to build a relocation utility for such blocks in the future.
So I see two immediate tasks in the direction of evolving NameNode into GBM:
Hope this makes sense?
Sure, it makes sense :)
Is there a way to implement RPC server and plug it in NameNode easily? Right now I see only a way to extend existing NameNode and provide an implementation of protected void initialize(Configuration conf)
method to set up and run RPC server.
My idea is to define GiraffaBlockManagementRpcServer (gbmRpcServer) and provide a method called allocateBlock()
for (1) task. Early implementation will use NameNode for block management operations, but I think that it'll be possible to replace it.
Do I understand atomic block creation correctly? Block should be allocated via one RPC request to a NN/GBM? Or I have to dig it deeply for allocating blocks without creating files?
I was thinking how we can introduce new calls into NameNode. In ConsensusNode we did replace the NN rpsServer similarly to what you propose, but it was somewhat tricky. May be we can avoid it. Let me open a new issue for task (1) so that we could focus on discussing just that there. Atomic block creation means single call to NN. During this call NN creates a file, allocates a block and assigns the blockId as the file name.
I'm wondering if failure of underlying NameNode from HDFS can crash underlying HBase and all of them will be a reason for Giraffa to stop working. Am I right? Or there is a special trick?