Closed VoVAllen closed 2 years ago
The list is pretty long. Maybe we could prioritize some of them?
I plan to implement most non-CV datasets. Not sure how far we should go for CV/point cloud dataset because many datasets are only used in the corresponding paper.
I think we should include KG datasets.
Another very important thing is documentation of the datasets. I think we should give some description of these datasets. For example, how the dataset was constructed and what nodes and edges represent and what are the features of nodes and edges in the dataset; some simple statistics (#nodes, #edges, #node types, #edge types, etc).
A summary table like this would be nice:
Dataset | Example Usage | #Samples | Avg #Nodes | Avg #Edges | Avg #NType | Avg #EType |
---|---|---|---|---|---|---|
CoraFull | data = dgl.data.CoraFull() |
The dataset name is linked to the docstring for more details such as how the dataset is constructed (what is node and what is edge), what are the features, etc. I think Stanford SNAP has a quite nice organization that we could borrow from.
Cannot find raw GEDDataset
. QM7
and QM9b
need gdata
support.
Maybe also add a bullet point for tutorial on using data format?
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
🚀 Feature
The goal is to use the recently merged DGL data format #728 to host popular datasets.
Dataset
CV/Mesh
NLP
I will not include mesh/cv/pointcloud dataset at this batch (seems too many and I don't know which to include). If anyone need any of those, please vote at the comment, and I'll try to implement them.
Other Transform