ganler / ResearchReading

General system research material (not limited to paper) reading notes.
GNU General Public License v3.0
20 stars 1 forks source link

SOSP'03 | The Google File System #38

Closed ganler closed 3 years ago

ganler commented 3 years ago

https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

From MIT6.824 lecture 3; & Shiv Big Data Sys Paper Reading: http://pages.cs.wisc.edu/~shivaram/cs744-fa20/

ganler commented 3 years ago

The 3 of Google: MapReduce + BigTable + GFS


Why big storage is important & Motivation

Why the file abstraction is unified: File is nothing but a byte string with a path;

Insights

Master + Chunk server

Master(big index table) => Meta Data [ file name ] --> index of chunks --> the chunk servers of related chunks;

Challenge 0

Legacy: Metadata is about the [chunk server id + the index] in that server; Thus:

Any changes/writes of chunk servers will need to notify the master for metadata modification;

Solution: Locality!

The legacy master metadata: [1. chunk server id, and 2. the index]

What seldom change: The server id; What often change: the index;

Improvement: move what changes frequently => local server; i.e., the master metadata only records the related chunk server id, not including the index;

Fault Tolerance

How to detect data corruption => Checksum;

A chunk contains many blocks(64k); For each block, we have a checksum(32 bit);

1T file => checksum size: 1T/64KB *32 bits = 64MB

Verify through checksum when reading the data;

What if the server fails

Replication;

A chunk can be stored in multiple chunk server replicas;

Q:

How many replicas?

How to select a chunk server?

How to know the chunk server is alive

Heartbeat;

If the server is gone, how to recover data;

Raise a recovery process to record all chunks who are losing replicas and how many replicas it lost;

Then we will prioritize to recover the chunk who has the least replicas (which is also considered the most dangerous one)


Thus we have a pretty safe file system. And our next step is to make it even safer;


HotSpot

When lots of reads/writes come...

Too many reads!

Q: How to select the new chunk server: based on the disk utilization and bandwidth...


Case Study

Read

Client -> REQ: Filename & Index -> GFS Master -> Find the related chunk server; ->RETURN: Handler (authority) to that server [chunk location + byte range];

Client -> chunk location + byte range ....... Get bytes!

Writes

Client: -> REQ: Filename & Index -> GFS Master -> RETURN handlers of primary + replicas;

Client: -> Send contents to the nearest/fast server; -> Once one server started receiving, the others who share the network of the company will internally sync. (the internal network in the company is usually much faster) [And they are pipelined!] -> Note that they are not going to directly write to the disk. Instead, it is stored in the cache (memory or disk). -> After everyone finished caching, the client can then notify the primary chunk server to start loading. -> The primary server will tell other servers to start loading.

Q: If the writes failed, how?

Just tell the user that it failed and plz re-request; (In this process the user can handle the error so that we don't need to introduce new fault-tolerance techniques. Distributed systems are so complex that new fault-tolerance techniques will usually bring new troubles...)

Q: Why not play direct writes?