dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.44k stars 3.01k forks source link

ogbn-arxiv text features don't match #7270

Open devinbost opened 6 months ago

devinbost commented 6 months ago

I noticed that the node abstracts in https://snap.stanford.edu/ogb/data/misc/ogbn_arxiv/titleabs.tsv.gz almost match in quantity (179,719) to the nodes obtained from the DGL graph (DglNodePropPredDataset(name='ogbn-arxiv')) (169,343). It's not clear yet to me why there is a discrepancy. However, it makes it difficult for me to map the text features of the nodes to the nodes in the DGL graph. Any explanation would be helpful.

TristonNV commented 6 months ago

Issue has been addressed in https://github.com/snap-stanford/ogb/issues/222. The reason is that not all the listed the node abstracts file are mapped into the graph, such as paper id 200971, with title "ontology as a source for rule generation". The real number of node can be found in the "nodeidx2paperid.csv" file. And this file is in arxiv/mapping folder after unzip.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you