dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.23k stars 2.99k forks source link

[GraphBolt] Refactor S3-FIFO and add SIEVE, LRU and CLOCK. #7508

Closed mfbalin closed 1 week ago

mfbalin commented 1 week ago

Description

Refactored the code and added the SIEVE, LRU and CLOCK cache policy algorithm.

CLOCK: https://people.csail.mit.edu/saltzer/Multics/MHP-Saltzer-060508/bookcases/M00s/M0104%20074-12).PDF

The S3-FIFO implementation was behaving a bit differently from the official implementation (upto 2% higher miss rate). So I wrote a short script to compare against the official implementation on several datasets as can be seen below. Moreover, I got rid of garbage collection if the ghost map gets too big and instead used a queue for it so that we never have to do extra work sometimes. This is because when using parallelism, we will wait for the slowest partition and when we do overlapping in the dataloader, these delays will cause a bubble in the pipeline.

Now, the behaviors consistently match across different datasets for S3-FIFO:

My miss rates (s3-fifo: 0.115895, lru: 0.165391, sieve: 0.168564, clock: 0.168460), miss rate original: (s3-fifo: 0.119878, lru: 0.165391), fiu/fiu_casa-110108-112108.oracleGeneral.zst: : 8973201it [20:06, 7437.01it/s]
My miss rates (s3-fifo: 0.316334, lru: 0.350072, sieve: 0.409565, clock: 0.349661), miss rate original: (s3-fifo: 0.316330, lru: 0.350072), fiu/fiu_homes-110108-112108.oracleGeneral.zst: : 17836701it [44:14, 6718.70it/s]
My miss rates (s3-fifo: 0.372138, lru: 0.414476, sieve: 0.384086, clock: 0.414683), miss rate original: (s3-fifo: 0.372138, lru: 0.414476), alibabaBlock/io_traces.ns0.oracleGeneral.zst: : 13800000it [35:29, 6481.64it/s]
My miss rates (s3-fifo: 0.220733, lru: 0.223484, sieve: 0.232164, clock: 0.224612), miss rate original: (s3-fifo: 0.220650, lru: 0.223484), tencentBlock/tencentBlock.ns1182.oracleGeneral.zst: : 16024596it [38:18, 6972.53it/s]
My miss rates (s3-fifo: 0.841126, lru: 0.870302, sieve: 0.850204, clock: 0.868771), miss rate original: (s3-fifo: 0.841126, lru: 0.870302), msr/msr_proj_2.oracleGeneral.zst: : 29266482it [1:30:12, 5407.19it/s]
import requests
import zstandard as zstd
from struct import iter_unpack
import torch
from dgl.graphbolt.impl import FeatureCache
from cachemonCache import S3FIFO, LRU
from tqdm import tqdm
import os

dtype = torch.int64
num_partitions = 1

base_URL = "https://ftp.pdl.cmu.edu/pub/datasets/twemcacheWorkload/cacheDatasets/"
datasets = [
    (50000, "fiu/fiu_casa-110108-112108.oracleGeneral.zst"),
    (100000, "fiu/fiu_homes-110108-112108.oracleGeneral.zst"),
    (100000, "alibabaBlock/io_traces.ns0.oracleGeneral.zst"),
    (100000, "tencentBlock/tencentBlock.ns1182.oracleGeneral.zst"),
    (1000000, "msr/msr_proj_2.oracleGeneral.zst"),
]
for capacity, dataset in datasets:
    data = requests.get(os.path.join(base_URL, dataset), stream=False).content
    pbar = tqdm(iter_unpack("=IQIq", zstd.ZstdDecompressor().decompress(data)))
    my_caches = {policy: FeatureCache([capacity, 1], dtype, num_partitions, policy) for policy in ["s3-fifo", "lru", "sieve", "clock"]}
    other_caches = {name: cache(capacity) for name, cache in zip(["s3-fifo", "lru"], [S3FIFO, LRU])}
    for i, (timestamp, obj_id, obj_size, next_access_vtime) in enumerate(pbar):
        for cache in my_caches.values():
            values, missing_index, missing_keys = cache.query(torch.tensor([obj_id]))
            if missing_keys.numel() > 0:
                cache.replace(missing_keys, torch.tensor([obj_size]))
        for cache in other_caches.values():
            val = cache.get(obj_id)
            if val is None:
                cache.put(obj_id, obj_size)
        if i % 1000 == 999:
            pbar.set_description(f"My miss rates ({", ".join([f"{name}: {cache.miss_rate:.6f}" for name, cache in my_caches.items()])}), miss rate original: ({", ".join([f"{name}: {1.0 - cache.n_hit / (i + 1):.6f}" for name, cache in other_caches.items()])}), {dataset}")
dgl-bot commented 1 week ago

To trigger regression tests:

dgl-bot commented 1 week ago

Commit ID: 6df5c68f8ff7a83bad134bc4d24bb6938a8b7d44

Build ID: 1

Status: ❌ CI test failed in Stage [Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: f30814b2dc208bc26c18cf1e3a7a0112932f3fbd

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: 955b55a7362ca871c5bcdd3980fd62535b1c3d94

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: 18ff8306587dc523e3dffb981beb4c4ee40c21c8

Build ID: 4

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: f0f27ba17e2f8fdf021888f16390a858d08af008

Build ID: 5

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: fb8eaf803d5062f473e0198f30ab87348e1d2499

Build ID: 6

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: 5247dab1bc45691876b0c7ef3784e0683c22f291

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: 237268e80d1d97b9d98f2d7bd8d6b1061c7576cc

Build ID: 8

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: f70a3a236d9f46f2dfbb2405bdcf7fe005f562ae

Build ID: 9

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: 1f1332c5a03c5ee0686017f3c5ee448652a29bf5

Build ID: 10

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: c312794b3593cd58a2ced2414bec0ba8581808ec

Build ID: 11

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 1 week ago

Commit ID: d73e3381fdcf22c59a8994b8b87e9da6f8a641f7

Build ID: 12

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link