[Roadmap][Dataset] Dataset related API

dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.

http://dgl.ai

Apache License 2.0

13.38k stars 3k forks source link

[Roadmap][Dataset] Dataset related API #849

Closed VoVAllen closed 2 years ago

VoVAllen commented 5 years ago

🚀 Feature

The goal is to use the recently merged DGL data format #728 to host popular datasets.

Dataset

[x] KarateClub
[x] TUDataset
[x] Planetoid
[x] CoraFull
[x] Coauthor
[x] Amazon
[x] PPI
[x] Reddit
[x] QM7b
[ ] QM9
[ ] Entities
[ ] GEDDataset
[x] BitcoinOTC
[x] ICEWS18
[x] GDELT

CV/Mesh

[ ] MNISTSuperpixels
[ ] FAUST
[ ] DynamicFAUST
[ ] ShapeNet
[ ] ModelNet
[ ] CoMA
[ ] SHREC2016
[ ] TOSCA
[ ] PCPNet dataset
[ ] S3DIS
[ ] GeometricShapes
[ ] WILLOWObjectClass
[ ] PascalVOCKeypoints

NLP

[ ] DBP15K

I will not include mesh/cv/pointcloud dataset at this batch (seems too many and I don't know which to include). If anyone need any of those, please vote at the comment, and I'll try to implement them.

Other Transform

[x] AddSelfLoops
[x] RemoveSelfLoops

jermainewang commented 5 years ago

The list is pretty long. Maybe we could prioritize some of them?

VoVAllen commented 5 years ago

I plan to implement most non-CV datasets. Not sure how far we should go for CV/point cloud dataset because many datasets are only used in the corresponding paper.

zheng-da commented 5 years ago

I think we should include KG datasets.

[ ] FB15k
[ ] WN18
[ ] FB15k-237
[ ] WN18RR

Another very important thing is documentation of the datasets. I think we should give some description of these datasets. For example, how the dataset was constructed and what nodes and edges represent and what are the features of nodes and edges in the dataset; some simple statistics (#nodes, #edges, #node types, #edge types, etc).

jermainewang commented 5 years ago

A summary table like this would be nice:

Dataset	Example Usage	#Samples	Avg #Nodes	Avg #Edges	Avg #NType	Avg #EType
CoraFull	`data = dgl.data.CoraFull()`

The dataset name is linked to the docstring for more details such as how the dataset is constructed (what is node and what is edge), what are the features, etc. I think Stanford SNAP has a quite nice organization that we could borrow from.

VoVAllen commented 5 years ago

Cannot find raw GEDDataset. QM7 and QM9b need gdata support.

mufeili commented 5 years ago

Maybe also add a bullet point for tutorial on using data format?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you