The S3-FIFO implementation was behaving a bit differently from the official implementation (upto 2% higher miss rate). So I wrote a short script to compare against the official implementation on several datasets as can be seen below. Moreover, I got rid of garbage collection if the ghost map gets too big and instead used a queue for it so that we never have to do extra work sometimes. This is because when using parallelism, we will wait for the slowest partition and when we do overlapping in the dataloader, these delays will cause a bubble in the pipeline.
Now, the behaviors consistently match across different datasets for S3-FIFO:
My miss rates (s3-fifo: 0.115895, lru: 0.165391, sieve: 0.168564, clock: 0.168460), miss rate original: (s3-fifo: 0.119878, lru: 0.165391), fiu/fiu_casa-110108-112108.oracleGeneral.zst: : 8973201it [20:06, 7437.01it/s]
My miss rates (s3-fifo: 0.316334, lru: 0.350072, sieve: 0.409565, clock: 0.349661), miss rate original: (s3-fifo: 0.316330, lru: 0.350072), fiu/fiu_homes-110108-112108.oracleGeneral.zst: : 17836701it [44:14, 6718.70it/s]
My miss rates (s3-fifo: 0.372138, lru: 0.414476, sieve: 0.384086, clock: 0.414683), miss rate original: (s3-fifo: 0.372138, lru: 0.414476), alibabaBlock/io_traces.ns0.oracleGeneral.zst: : 13800000it [35:29, 6481.64it/s]
My miss rates (s3-fifo: 0.220733, lru: 0.223484, sieve: 0.232164, clock: 0.224612), miss rate original: (s3-fifo: 0.220650, lru: 0.223484), tencentBlock/tencentBlock.ns1182.oracleGeneral.zst: : 16024596it [38:18, 6972.53it/s]
My miss rates (s3-fifo: 0.841126, lru: 0.870302, sieve: 0.850204, clock: 0.868771), miss rate original: (s3-fifo: 0.841126, lru: 0.870302), msr/msr_proj_2.oracleGeneral.zst: : 29266482it [1:30:12, 5407.19it/s]
import requests
import zstandard as zstd
from struct import iter_unpack
import torch
from dgl.graphbolt.impl import FeatureCache
from cachemonCache import S3FIFO, LRU
from tqdm import tqdm
import os
dtype = torch.int64
num_partitions = 1
base_URL = "https://ftp.pdl.cmu.edu/pub/datasets/twemcacheWorkload/cacheDatasets/"
datasets = [
(50000, "fiu/fiu_casa-110108-112108.oracleGeneral.zst"),
(100000, "fiu/fiu_homes-110108-112108.oracleGeneral.zst"),
(100000, "alibabaBlock/io_traces.ns0.oracleGeneral.zst"),
(100000, "tencentBlock/tencentBlock.ns1182.oracleGeneral.zst"),
(1000000, "msr/msr_proj_2.oracleGeneral.zst"),
]
for capacity, dataset in datasets:
data = requests.get(os.path.join(base_URL, dataset), stream=False).content
pbar = tqdm(iter_unpack("=IQIq", zstd.ZstdDecompressor().decompress(data)))
my_caches = {policy: FeatureCache([capacity, 1], dtype, num_partitions, policy) for policy in ["s3-fifo", "lru", "sieve", "clock"]}
other_caches = {name: cache(capacity) for name, cache in zip(["s3-fifo", "lru"], [S3FIFO, LRU])}
for i, (timestamp, obj_id, obj_size, next_access_vtime) in enumerate(pbar):
for cache in my_caches.values():
values, missing_index, missing_keys = cache.query(torch.tensor([obj_id]))
if missing_keys.numel() > 0:
cache.replace(missing_keys, torch.tensor([obj_size]))
for cache in other_caches.values():
val = cache.get(obj_id)
if val is None:
cache.put(obj_id, obj_size)
if i % 1000 == 999:
pbar.set_description(f"My miss rates ({", ".join([f"{name}: {cache.miss_rate:.6f}" for name, cache in my_caches.items()])}), miss rate original: ({", ".join([f"{name}: {1.0 - cache.n_hit / (i + 1):.6f}" for name, cache in other_caches.items()])}), {dataset}")
@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master
Description
Refactored the code and added the SIEVE, LRU and CLOCK cache policy algorithm.
CLOCK: https://people.csail.mit.edu/saltzer/Multics/MHP-Saltzer-060508/bookcases/M00s/M0104%20074-12).PDF
The S3-FIFO implementation was behaving a bit differently from the official implementation (upto 2% higher miss rate). So I wrote a short script to compare against the official implementation on several datasets as can be seen below. Moreover, I got rid of garbage collection if the ghost map gets too big and instead used a queue for it so that we never have to do extra work sometimes. This is because when using parallelism, we will wait for the slowest partition and when we do overlapping in the dataloader, these delays will cause a bubble in the pipeline.
Now, the behaviors consistently match across different datasets for S3-FIFO: