Open jermainewang opened 2 years ago
We use the table below to track the current practice of cache versioning for the existing datasets and the cases it fails to handle.
Dataset | Current versioning mechanism | Missing versioning behavior | Other issues |
---|---|---|---|
YelpDataset | Whether the graph is reordered | ||
WikiCSDataset | The graph is always reordered, which should be optional and have a default value of False. | ||
LegacyTUDataset | Append a hash value to the cache file name, i.e., f"legacy_tu_{dataset_name}_{hash_value}.bin" , which encodes name, use_pandas, hidden_size, max_allow_node |
Since multiple datasets can be loaded with this interface, it makes more sense to have one cache file per dataset rather than a single cache file that gets overwritten whenever a different dataset is loaded. | |
TUDataset | Use a different file name for each dataset that employs this interface, i.e., f"tu_{dataset_name}.bin" |
||
SST/SSTDataset | Use one file per mode, which is one of ["train", "dev", "test", "tiny"] | ||
BAShapeDataset | Handle different pre-processing options, including num_base_nodes , num_base_edges_per_node , num_motifs , perturb_ratio , seed |
||
BACommunityDataset | Handle different pre-processing options, including num_base_nodes , num_base_edges_per_node , num_motifs , perturb_ratio , num_inter_edges , seed |
||
TreeCycleDataset | Handle different pre-processing options, including tree_height , num_motifs , cycle_size , perturb_ratio , seed |
||
TreeGridDataset | Handle different pre-processing options, including tree_height , num_motifs , grid_size , perturb_ratio , seed |
||
BA2MotifDataset | |||
SBMMixture/SBMMixtureDataset | Append a hash value to the cache file name, i.e., f"graphs_{hash_value}.bin" , which encodes n_graphs, n_nodes, n_communities, k, avg_deg, pq, rng |
||
RedditDataset | Use two separate directories to cache the variant with self loops and the variant without self loops | ||
AIFBDataset | Handle different pre-processing options, including insert_reverse |
||
MUTAGDataset | Handle different pre-processing options, including insert_reverse |
||
BGSDataset | Handle different pre-processing options, including insert_reverse |
||
AMDataset | Handle different pre-processing options, including insert_reverse |
||
QM9EdgeDataset/QM9Edge | |||
QM9Dataset/QM9 | |||
QM7bDataset/QM7b | |||
PPIDataset/LegacyPPIDataset | Use separate files to cache the data for different mode ("train", "valid", "test") |
||
PATTERNDataset | Use separate files to cache the data for different mode ("train", "valid", "test") |
||
MiniGCDataset | Append a hash value to the cache file name, i.e., f"dgl_graph_{hash_value}.bin" , which encodes num_graphs, min_num_v, max_num_v, seed |
||
FB15k237Dataset | Handle different pre-processing options, including reverse |
||
FB15kDataset | Handle different pre-processing options, including reverse |
||
WN18Dataset | Handle different pre-processing options, including reverse |
||
KarateClub/KarateClubDataset | |||
ICEWS18/ICEWS18Dataset | Use separate files to cache the data for different mode ("train", "valid", "test") |
||
All datasets that inherit GNNBenchmarkDataset |
The graph is always reordered, which should be optional and have a default value of False. | ||
GINDataset | Append a hash value to the cache file name, i.e., f"gin_{data_name}_{hash_value}.bin" , which encodes name, self_loop, degree_as_nlabel |
||
GDELT/GDELTDataset | Use separate files to cache the data for different mode ("train", "valid", "test") |
||
FraudYelpDataset | Append a hash value to the cache file name, i.e., f"_dgl_graph_{hash_value}.bin" , which encodes random_seed, train_size, val_size |
||
FraudAmazonDataset | Append a hash value to the cache file name, i.e., f"_dgl_graph_{hash_value}.bin" , which encodes random_seed, train_size, val_size |
||
FlickrDataset | Whether the graph is reordered | ||
FakeNewsDataset | Handle different pre-processing options, including feature_name |
||
CLUSTERDataset | Use separate files to cache the data for different mode ("train", "valid", "test") |
||
BitcoinOTC/BitcoinOTCDataset | |||
CiteseerGraphDataset | Handle different pre-processing options, including reverse_edge , reorder |
||
CoraGraphDataset | Handle different pre-processing options, including reverse_edge , reorder |
||
PubmedGraphDataset | Handle different pre-processing options, including reverse_edge , reorder |
||
CoraBinary | |||
AsNodePredDataset | Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes split_ratio, target_ntype, dataset.name |
||
AsLinkPredDataset | Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes neg_ratio, split_ratio, dataset.name |
||
AsGraphPredDataset | Append a hash value to the cache file name, i.e., f"graph_{hash_value}.bin", which encodes split_ratio, dataset.name |
In addition, the versioning mechanism should detect:
In both cases, the cache files need to be re-generated.
In general, hashing is an effective way to prevent loading an undesired cached file. However, its downside is that there can be many cache files if there are a huge number of possible combinations of preprocessing options. One solution is to instead save a file storing only the hash code or preprocessing steps for a sanity check when data loading is attempted. If this fails, then the data will be re-processed from scratch.
🚀 Feature
Add versioning to all DGLDatasets to detect:
Motivation
Brought up by #3987 which asks to revert the default reordering behavior of DGL builtin datasets. The issue is that even if we've implemented the request, users may still load cached datasets from local disk which may not reflect the latest change. Therefore, we require some versioning mechanism to detect those changes.
cc @mufeili
Alternatives
Use a different dataset folder whenever DGL updates. This could cause excessive disk storage use.