SOSP'03 | The Google File System

The 3 of Google: MapReduce + BigTable + GFS

Why big storage is important & Motivation

Files are unified abstractions (e.g., files are the unified API for UNIX commands) : KV [filepath -> chunks -> diskOffset]
- Cuz files are global while traditional FSes like FAT32 are built for single machine/disk;
- For super huge files => One disk is not capable;

Why the file abstraction is unified: File is nothing but a byte string with a path;

Insights

Master + Chunk server

Master(big index table) => Meta Data [ file name ] --> index of chunks --> the chunk servers of related chunks;

Challenge 0

Legacy: Metadata is about the [chunk server id + the index] in that server; Thus:

Any changes/writes of chunk servers will need to notify the master for metadata modification;

Solution: Locality!

The legacy master metadata: [1. chunk server id, and 2. the index]

What seldom change: The server id; What often change: the index;

Improvement: move what changes frequently => local server; i.e., the master metadata only records the related chunk server id, not including the index;

Fault Tolerance

How to detect data corruption => Checksum;

A chunk contains many blocks(64k); For each block, we have a checksum(32 bit);

1T file => checksum size: 1T/64KB *32 bits = 64MB

Verify through checksum when reading the data;

What if the server fails

Replication;

A chunk can be stored in multiple chunk server replicas;

How many replicas?

How to select a chunk server?

Low disk utilization;
Q: When we have a new disk and the rule above will cause a hotspot problem!
- S: limit the writes for the latest data chunk; (you should not always write to one server!)
Cross-machine center:
- Let's say we have different clusters in different geographical locations;
- 2+1:
- Then we'd better store data in different clusters so that when one cluster is destroyed we have another whole replica;
- Even in the same cluster, we should store the data in different boards for the same reason;

How to know the chunk server is alive

Heartbeat;

If the server is gone, how to recover data;

Raise a recovery process to record all chunks who are losing replicas and how many replicas it lost;

Then we will prioritize to recover the chunk who has the least replicas (which is also considered the most dangerous one)

Thus we have a pretty safe file system. And our next step is to make it even safer;

HotSpot

When lots of reads/writes come...

Too many reads!

Record who is hot!
For hot chunks, let's make even more replicas for load balancing!

Q: How to select the new chunk server: based on the disk utilization and bandwidth...

Case Study

Read

Client -> REQ: Filename & Index -> GFS Master -> Find the related chunk server; ->RETURN: Handler (authority) to that server [chunk location + byte range];

Client -> chunk location + byte range ....... Get bytes!

Writes

Client: -> REQ: Filename & Index -> GFS Master -> RETURN handlers of primary + replicas;

Client: -> Send contents to the nearest/fast server; -> Once one server started receiving, the others who share the network of the company will internally sync. (the internal network in the company is usually much faster) [And they are pipelined!] -> Note that they are not going to directly write to the disk. Instead, it is stored in the cache (memory or disk). -> After everyone finished caching, the client can then notify the primary chunk server to start loading. -> The primary server will tell other servers to start loading.

Q: If the writes failed, how?

Just tell the user that it failed and plz re-request; (In this process the user can handle the error so that we don't need to introduce new fault-tolerance techniques. Distributed systems are so complex that new fault-tolerance techniques will usually bring new troubles...)

Q: Why not play direct writes?

ganler / ResearchReading