Open DaviesX opened 7 years ago
Quantcast is using C/C++ based distributed file system, an almost drop-in replacement for HDFS. They seems to be managing few peta bytes of data using QFS. You can refer their Git wiki for comparison with HDFS. QFS is based on Kosmos file system.
Is it significant to rewrite HDFS, Hadoop/Spark or HBase in C/C++ to improve computational efficiency?
As we all know, Java or Scala is very slow compared to C/C++. I wonder whether it is significant to rewrite the low level implementation in C/C++, and still keep the user interface in a high level language.
Yes.
It is very significant. At MapR we have done exactly this with fairly impressive results. We used C++, but in a very Google-ish subset style. Cutting to the final results, the benefits are:
1) much higher performance in terms of bandwidth and throughput
2) much lower and more predictable latency
3) the re-architecture involved in re-implementing HDFS allowed us to rethink many of the limitations that HDFS has such as the inherent single point of failure represented by the name-node, the fact that files cannot be updated randomly, very limited file commit rate.
It is impossible to completely disentangle the question of how much benefit the language choice made versus how much benefit the ground up re-architecture effort had, but here are some key points that relate to language:
MapR FS uses a custom threading executive to support tens of thousands or more micro-threads on a single core. This micro-thread idiom allows many locks to be completely avoided which massively improves RPC performance. Doing this in a language without complete control over the machine would be incredibly difficult. Doing this in a language with complete control allows the system to work with the memory hierarchy to further enhance performance.
Explicit and tight control over memory allocation allows a smaller memory footprint for the same amount of active memory. This translates either into better performance with equal memory or equal performance with less memory. Either way, this is a big win. This helps cache friendliness as well.
Using C made it much easier to use shared memory for communication within a single node. This relates to the memory allocation, of course. You could do some of this with Java's unsafe package, but it is incredibly much more painful.
C allows arrays of structs. This can be huge in a few narrow spots.
C allows you to know and understand the memory layout of structs. This can help make building disk structures easier.
C allows much easier access to platform specific features such as libaio or the use of O_DIRECT
On the other hand, coding at a lower level is more expensive and can be more error-prone if you don't have the right coders. We have the right ones at MapR, but most places don't.
The overall design of the system is that the core file system (which includes the distributed file system and the noSQL database implementation) is written in C++. Wire protocols are built using protobufs uniformly and clients are available in C or Java. In some cases, Java client code uses JNI to access shared memory efficiently.
The results are very impressive. For instance, HDFS can commit meta-data changes a few hundred times per second. In a large minute-sort benchmark, a MapR system was clocked at about 15 million meta-data updates per second. These included file creation, deletion and so on.
Another area of improvement is in overall scalability. Whereas HDFS limits you to around 100 million files for a large cluster and less for smaller clusters, MapR clusters have run with billions of files and no performance degradation.
In production, these benefits typically manifest in the form of smaller clusters required to meet a throughput deliverable with noticeably more predictable and small latencies.
One primary goal for the development was to present the cluster as a fully functional conventional storage system so that legacy code can work interchangeably with Hadoop code. This is done first by exposing a fully distributed NFS interface as well as the standard HDFS API. In addition, MapR systems implement completely atomic snapshots and mirrors. Snapshots capture the instantaneous state of the file system while mirrors propagate that state to a remote replica.
This answer is getting kind of long, but if you have specific questions, I can expand it as needed. 17k Views · 164 Upvotes · View Timeline