Open Jn-Huang opened 9 months ago
Thanks @Jn-Huang for the inputs!
Why do we want to save the optional raw texts into a dictionary? Imo, a more consistent way is to treat every kind of raw text as a normal (node/edge/graph) attribute. For example, if the nodes are papers, then titles are one node attribute, and abstracts are another node attributes.
What do you think?
@xingjian-zhang Thank you for you comment! Yes I think this is a good point, saving the node raw texts as extra node attributes will be better. I will update this PR.
@xingjian-zhang Hi! I have updated the implementation as we discussed. Could you please take a look when you are available? Thanks!
The metadata is not updated as
{
"description": "CORA dataset.",
"data": {
"Node": {
"NodeFeature": {
"description": "Node features of Cora dataset, 1/0-valued vectors.",
"type": "int",
"format": "SparseTensor",
"file": "cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz"
},
<------------ New ------------ >
"NodeRawTextTitle": {
"description": "Raw text of title of each node in Cora dataset, list of strings.",
"type": "str",
"format": "List[str]",
"optional file": "cora__graph__Node_NodeRawTextTitle__4a9ad6575f5acfe3b828fe66f072bd5c.optional.npz",
"key": "Node_NodeRawTextTitle"
},
"NodeRawTextAbstract": {
"description": "Raw text of abstract of each node in Cora dataset, list of strings.",
"type": "str",
"format": "List[str]",
"optional file": "cora__graph__Node_NodeRawTextAbstract__d0e5436087314624c74a9f040d6f394f.optional.npz",
"key": "Node_NodeRawTextAbstract"
},
"NodeRawTextLabel": {
"description": "Raw text of label of each node in Cora dataset, list of strings.",
"type": "str",
"format": "List[str]",
"optional file": "cora__graph__Node_NodeRawTextLabel__06d184316789acc0902db2b8c1472f95.optional.npz",
"key": "Node_NodeRawTextLabel"
}
<------------ New ------------ >
},
# Other Attributes
}
where the raw texts related to nodes are saved as node attributes.
Dataset contributors can save such raw text by defining extra node attributes:
node_attrs = [
Attribute(
"NodeFeature",
node_feats,
"Node features of Cora dataset, 1/0-valued vectors.",
"int",
"SparseTensor",
),
<------------ New ------------ >
Attribute(
"NodeRawTextTitle",
raw_text_dict["title"],
"Raw text of title of each node in Cora dataset, list of strings.",
"str",
"List[str]"
),
Attribute(
"NodeRawTextAbstract",
raw_text_dict["abs"],
"Raw text of abstract of each node in Cora dataset, list of strings.",
"str",
"List[str]"
),
Attribute(
"NodeRawTextLabel",
raw_text_dict["label"],
"Raw text of label of each node in Cora dataset, list of strings.",
"str",
"List[str]"
)
<------------ New ------------ >
]
For users who want to do load a dataset with raw text, they can simply do
dataset = get_gli_dataset("cora",
"NodeClassification",
load_raw_text=True, # <--- New
verbose=True)
data = dataset[0]
And the raw texts are stored in
data.NodeRawTextTitle[0], data.NodeRawTextAbstract[0], data.NodeRawTextLabel[0]
Output:
('Title: The megaprior heuristic for discovering protein sequence patterns ',
'Abstract: Several computer algorithms for discovering patterns in groups of protein sequences are in use that are based on fitting the parameters of a statistical model to a group of related sequences. These include hidden Markov model (HMM) algorithms for multiple sequence alignment, and the MEME and Gibbs sampler algorithms for discovering motifs. These algorithms are sometimes prone to producing models that are incorrect because two or more patterns have been combined. The statistical model produced in this situation is a convex combination (weighted average) of two or more different models. This paper presents a solution to the problem of convex combinations in the form of a heuristic based on using extremely low variance Dirichlet mixture priors as part of the statistical model. This heuristic, which we call the megaprior heuristic, increases the strength (i.e., decreases the variance) of the prior in proportion to the size of the sequence dataset. This causes each column in the final model to strongly resemble the mean of a single component of the prior, regardless of the size of the dataset. We describe the cause of the convex combination problem, analyze it mathematically, motivate and describe the implementation of the megaprior heuristic, and show how it can effectively eliminate the problem of convex combinations in protein sequence pattern discovery. ',
'Neural Networks')
Note, here we cannot save raw texts in data.ndata
, because dgl
enforce that each element in ndata
is a tensor. And it's not a good practice to save lists of strings as tensor.
Similar testing are conducted for this version of implementation.
In this PR, we
RawText
in parallel withNode
,Edge
andGraph
.gli/raw_text_utils.py
. This file will help the saving of datasets with raw texts.Cora
dataset's metadata to accommodate the raw texts.test_metadata.py
to accommodate the cases we useoptional file
instead offile
inmetadata.json
.Subsequently, after this PR is merged, we will update other datasets with raw text:
ogbn-arxiv
,pubmed
,ogbn-product
,arxiv-2023
.Description (Outdated, See below comments for the newest version)
For dataset contributor with raw text.
Updated
Cora
metadata comes with an extra entryRawText
:The generation of this extra entry is simple, dataset contributors can simply save a dictionary of raw texts by passing
raw_text_attrs
tosave_graph
:With generality, dataset contributors can also define other raw texts dictionaries, such as
EdgeRawText
.For users who want to load a dataset with raw text.
Users can load the dataset with raw text by passing an optional argument
load_raw_text
toget_gli_dataset
. This argument will download the.npz
file for raw text, if now downloaded yet.The the raw text will be returned in the dictionary
data.NodeRawText['RawText_NodeRawText']
:Output:
Related Issue
See issue.
Motivation and Context
Support the loading of raw text in GLI framework.
How Has This Been Tested?
Get a dataset, but do not load raw text. Raw text file should not be downloaded.
At the same time the raw text file is not downloaded.
Get a dataset and load raw text. Raw text file should be downloaded.
Load a dataset without raw text, simply load the dataset without raw text