Review of the current practice

We use the table below to track the current practice of cache versioning for the existing datasets and the cases it fails to handle.

Dataset	Current versioning mechanism	Missing versioning behavior	Other issues
YelpDataset		Whether the graph is reordered
WikiCSDataset			The graph is always reordered, which should be optional and have a default value of False.
LegacyTUDataset	Append a hash value to the cache file name, i.e., `f"legacy_tu_{dataset_name}_{hash_value}.bin"`, which encodes `name, use_pandas, hidden_size, max_allow_node`	Since multiple datasets can be loaded with this interface, it makes more sense to have one cache file per dataset rather than a single cache file that gets overwritten whenever a different dataset is loaded.
TUDataset	Use a different file name for each dataset that employs this interface, i.e., `f"tu_{dataset_name}.bin"`
SST/SSTDataset	Use one file per mode, which is one of ["train", "dev", "test", "tiny"]
BAShapeDataset		Handle different pre-processing options, including `num_base_nodes`, `num_base_edges_per_node`, `num_motifs`, `perturb_ratio`, `seed`
BACommunityDataset		Handle different pre-processing options, including `num_base_nodes`, `num_base_edges_per_node`, `num_motifs`, `perturb_ratio`, `num_inter_edges`, `seed`
TreeCycleDataset		Handle different pre-processing options, including `tree_height`, `num_motifs`, `cycle_size`, `perturb_ratio`, `seed`
TreeGridDataset		Handle different pre-processing options, including `tree_height`, `num_motifs`, `grid_size`, `perturb_ratio`, `seed`
BA2MotifDataset
SBMMixture/SBMMixtureDataset	Append a hash value to the cache file name, i.e., `f"graphs_{hash_value}.bin"`, which encodes `n_graphs, n_nodes, n_communities, k, avg_deg, pq, rng`
RedditDataset	Use two separate directories to cache the variant with self loops and the variant without self loops
AIFBDataset		Handle different pre-processing options, including `insert_reverse`
MUTAGDataset		Handle different pre-processing options, including `insert_reverse`
BGSDataset		Handle different pre-processing options, including `insert_reverse`
AMDataset		Handle different pre-processing options, including `insert_reverse`
QM9EdgeDataset/QM9Edge
QM9Dataset/QM9
QM7bDataset/QM7b
PPIDataset/LegacyPPIDataset	Use separate files to cache the data for different `mode` ("train", "valid", "test")
PATTERNDataset	Use separate files to cache the data for different `mode` ("train", "valid", "test")
MiniGCDataset	Append a hash value to the cache file name, i.e., `f"dgl_graph_{hash_value}.bin"`, which encodes `num_graphs, min_num_v, max_num_v, seed`
FB15k237Dataset		Handle different pre-processing options, including `reverse`
FB15kDataset		Handle different pre-processing options, including `reverse`
WN18Dataset		Handle different pre-processing options, including `reverse`
KarateClub/KarateClubDataset
ICEWS18/ICEWS18Dataset	Use separate files to cache the data for different `mode` ("train", "valid", "test")
All datasets that inherit `GNNBenchmarkDataset`			The graph is always reordered, which should be optional and have a default value of False.
GINDataset	Append a hash value to the cache file name, i.e., `f"gin_{data_name}_{hash_value}.bin"`, which encodes `name, self_loop, degree_as_nlabel`
GDELT/GDELTDataset	Use separate files to cache the data for different `mode` ("train", "valid", "test")
FraudYelpDataset	Append a hash value to the cache file name, i.e., `f"_dgl_graph_{hash_value}.bin"`, which encodes `random_seed, train_size, val_size`
FraudAmazonDataset	Append a hash value to the cache file name, i.e., `f"_dgl_graph_{hash_value}.bin"`, which encodes `random_seed, train_size, val_size`
FlickrDataset		Whether the graph is reordered
FakeNewsDataset		Handle different pre-processing options, including `feature_name`
CLUSTERDataset	Use separate files to cache the data for different `mode` ("train", "valid", "test")
BitcoinOTC/BitcoinOTCDataset
CiteseerGraphDataset		Handle different pre-processing options, including `reverse_edge`, `reorder`
CoraGraphDataset		Handle different pre-processing options, including `reverse_edge`, `reorder`
PubmedGraphDataset		Handle different pre-processing options, including `reverse_edge`, `reorder`
CoraBinary
AsNodePredDataset	Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes `split_ratio, target_ntype, dataset.name`
AsLinkPredDataset	Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes `neg_ratio, split_ratio, dataset.name`
AsGraphPredDataset	Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes `split_ratio, dataset.name`

In addition, the versioning mechanism should detect:

If the current version of DGL is different from that used for generating the cache file
If the raw dataset files stored in S3 have been changed

In both cases, the cache files need to be re-generated.

Proposal

In general, hashing is an effective way to prevent loading an undesired cached file. However, its downside is that there can be many cache files if there are a huge number of possible combinations of preprocessing options. One solution is to instead save a file storing only the hash code or preprocessing steps for a sanity check when data loading is attempted. If this fails, then the data will be re-processed from scratch.

dmlc / dgl

Add versioning to all DGLDatasets to force reloading when codes are changed #4293

🚀 Feature

Motivation

Alternatives

Review of the current practice

Proposal