[Feature Update] Optional raw text support

Jn-Huang commented 9 months ago

In this PR, we

Support the saving and loading of raw text in GLI datasets. In metadata, raw text will occupy an entry RawText in parallel with Node, Edge and Graph.
Add helper functions to process raw texts from source in gli/raw_text_utils.py. This file will help the saving of datasets with raw texts.
Update the notebook to generate Cora dataset's metadata to accommodate the raw texts.
Update the test_metadata.py to accommodate the cases we use optional file instead of file in metadata.json.

Subsequently, after this PR is merged, we will update other datasets with raw text: ogbn-arxiv, pubmed, ogbn-product, arxiv-2023.

Description (Outdated, See below comments for the newest version)

For dataset contributor with raw text.

Updated Cora metadata comes with an extra entry RawText:

{
    "description": "CORA dataset.",
    "data": {
        "Node": {
            "NodeFeature": {
                "description": "Node features of Cora dataset, 1/0-valued vectors.",
                "type": "int",
                "format": "SparseTensor",
                "file": "cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz"
            },
        },
        # Edge, Graph entries..
        <------------ New ------------ >
        "RawText": {
            "NodeRawText": {
                "description": "Raw text of title, abstract and label of each node in Cora dataset, dict of list of strings.",
                "type": "Dict",
                "format": "Dict[str, list[str]]",
                "optional file": "cora__graph__835178b65ba8cfdfb9c91f33c6260506.optional.npz",
                "key": "RawText_NodeRawText"
            }
        }
        <------------ New ------------ >
    }
}

The generation of this extra entry is simple, dataset contributors can simply save a dictionary of raw texts by passing raw_text_attrs to save_graph:

<------------ New ------------ >
raw_text_attrs = [
    Attribute(
        "NodeRawText",
        raw_text_dict,
        "Raw text of title, abstract and label of each node in Cora dataset, dict of list of strings.",
        "Dict",
        'Dict[str, list[str]]'
    )
]
<------------ New ------------ >

metadata = save_graph(
    name="Cora",
    edge=edge,
    num_nodes=graph.num_nodes(),
    node_attrs=node_attrs,
    raw_text_attrs=raw_text_attrs, # <--- New
    description="CORA dataset."
)

With generality, dataset contributors can also define other raw texts dictionaries, such as EdgeRawText.

For users who want to load a dataset with raw text.

Users can load the dataset with raw text by passing an optional argument load_raw_text to get_gli_dataset. This argument will download the .npz file for raw text, if now downloaded yet.

dataset = get_gli_dataset("cora",
                          "NodeClassification",
                          load_raw_text=True, # <--- New
                          verbose=True)

The the raw text will be returned in the dictionary data.NodeRawText['RawText_NodeRawText']:

data = dataset[0]
for key, item in data.NodeRawText['RawText_NodeRawText'].items():
    print(key, item[:1])

Output:

title ['Title: The megaprior heuristic for discovering protein sequence patterns  ']
abs ['Abstract: Several computer algorithms for discovering patterns in groups of protein sequences are ...']
label ['Neural Networks']

Related Issue

See issue.

Motivation and Context

Support the loading of raw text in GLI framework.

How Has This Been Tested?

Get a dataset, but do not load raw text. Raw text file should not be downloaded.

In [3]: dataset = get_gli_dataset("cora", "NodeClassification", verbose=True)
Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/Cora/cora__task_node_classification_1__41e167258678b585872679839ce9c40f.npz’

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz’

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/Cora/cora__graph__6c912909fa18eff10797210ea5e485fe.npz’

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/Cora/cora__graph__Graph_NodeList__23bbef862fd6037395412eb03b4e1d9c.sparse.npz’

CORA dataset.
All data files already exist. Skip downloading.
Node classification on CORA dataset. Planetoid split.

At the same time the raw text file is not downloaded.

Get a dataset and load raw text. Raw text file should be downloaded.

In [4]: dataset = get_gli_dataset("cora", "NodeClassification", load_raw_text=True, verbose=True)

Saving to: ‘/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__835178b65ba8cfdfb9c91f33c6260506.optional.npz’

/Users/jinhuang/Documents/research/gli/datasets/cora/cora__task_node_classification_1__41e167258678b585872679839ce9c40f.npz already exists. Skip downloading.
/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz already exists. Skip downloading.
/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__6c912909fa18eff10797210ea5e485fe.npz already exists. Skip downloading.
/Users/jinhuang/Documents/research/gli/datasets/cora/cora__graph__Graph_NodeList__23bbef862fd6037395412eb03b4e1d9c.sparse.npz already exists. Skip downloading.
CORA dataset.
All data files already exist. Skip downloading.
Node classification on CORA dataset. Planetoid split.

In [6]: data = dataset[0]

In [7]: data.NodeRawText['RawText_NodeRawText'].keys()
Out[7]: dict_keys(['title', 'abs', 'label'])

Load a dataset without raw text, simply load the dataset without raw text

In [3]: dataset = get_gli_dataset("pubmed", "NodeClassification", load_raw_text=True)

xingjian-zhang commented 9 months ago

Thanks @Jn-Huang for the inputs!

Why do we want to save the optional raw texts into a dictionary? Imo, a more consistent way is to treat every kind of raw text as a normal (node/edge/graph) attribute. For example, if the nodes are papers, then titles are one node attribute, and abstracts are another node attributes.

What do you think?

Jn-Huang commented 9 months ago

@xingjian-zhang Thank you for you comment! Yes I think this is a good point, saving the node raw texts as extra node attributes will be better. I will update this PR.

Jn-Huang commented 8 months ago

@xingjian-zhang Hi! I have updated the implementation as we discussed. Could you please take a look when you are available? Thanks!

The metadata is not updated as

{
    "description": "CORA dataset.",
    "data": {
        "Node": {
            "NodeFeature": {
                "description": "Node features of Cora dataset, 1/0-valued vectors.",
                "type": "int",
                "format": "SparseTensor",
                "file": "cora__graph__Node_NodeFeature__7032c9c380d1889061dcbbcd76b8c427.sparse.npz"
            },
            <------------ New ------------ >
            "NodeRawTextTitle": {
                "description": "Raw text of title of each node in Cora dataset, list of strings.",
                "type": "str",
                "format": "List[str]",
                "optional file": "cora__graph__Node_NodeRawTextTitle__4a9ad6575f5acfe3b828fe66f072bd5c.optional.npz",
                "key": "Node_NodeRawTextTitle"
            },
            "NodeRawTextAbstract": {
                "description": "Raw text of abstract of each node in Cora dataset, list of strings.",
                "type": "str",
                "format": "List[str]",
                "optional file": "cora__graph__Node_NodeRawTextAbstract__d0e5436087314624c74a9f040d6f394f.optional.npz",
                "key": "Node_NodeRawTextAbstract"
            },
            "NodeRawTextLabel": {
                "description": "Raw text of label of each node in Cora dataset, list of strings.",
                "type": "str",
                "format": "List[str]",
                "optional file": "cora__graph__Node_NodeRawTextLabel__06d184316789acc0902db2b8c1472f95.optional.npz",
                "key": "Node_NodeRawTextLabel"
            }
            <------------ New ------------ >
        },
        # Other Attributes
}

where the raw texts related to nodes are saved as node attributes.

Dataset contributors can save such raw text by defining extra node attributes:

node_attrs = [
    Attribute(
        "NodeFeature",
        node_feats,
        "Node features of Cora dataset, 1/0-valued vectors.",
        "int",
        "SparseTensor",
    ),
    <------------ New ------------ >
    Attribute(
        "NodeRawTextTitle",
        raw_text_dict["title"],
        "Raw text of title of each node in Cora dataset, list of strings.",
        "str",
        "List[str]"
    ),
    Attribute(
        "NodeRawTextAbstract",
        raw_text_dict["abs"],
        "Raw text of abstract of each node in Cora dataset, list of strings.",
        "str",
        "List[str]"
    ),
    Attribute(
        "NodeRawTextLabel",
        raw_text_dict["label"],
        "Raw text of label of each node in Cora dataset, list of strings.",
        "str",
        "List[str]"
    )
    <------------ New ------------ >
]

For users who want to do load a dataset with raw text, they can simply do

dataset = get_gli_dataset("cora",
                          "NodeClassification",
                          load_raw_text=True, # <--- New
                          verbose=True)
data = dataset[0]

And the raw texts are stored in

data.NodeRawTextTitle[0], data.NodeRawTextAbstract[0], data.NodeRawTextLabel[0]

Output:

('Title: The megaprior heuristic for discovering protein sequence patterns  ',
 'Abstract: Several computer algorithms for discovering patterns in groups of protein sequences are in use that are based on fitting the parameters of a statistical model to a group of related sequences. These include hidden Markov model (HMM) algorithms for multiple sequence alignment, and the MEME and Gibbs sampler algorithms for discovering motifs. These algorithms are sometimes prone to producing models that are incorrect because two or more patterns have been combined. The statistical model produced in this situation is a convex combination (weighted average) of two or more different models. This paper presents a solution to the problem of convex combinations in the form of a heuristic based on using extremely low variance Dirichlet mixture priors as part of the statistical model. This heuristic, which we call the megaprior heuristic, increases the strength (i.e., decreases the variance) of the prior in proportion to the size of the sequence dataset. This causes each column in the final model to strongly resemble the mean of a single component of the prior, regardless of the size of the dataset. We describe the cause of the convex combination problem, analyze it mathematically, motivate and describe the implementation of the megaprior heuristic, and show how it can effectively eliminate the problem of convex combinations in protein sequence pattern discovery. ',
 'Neural Networks')

Note, here we cannot save raw texts in data.ndata, because dgl enforce that each element in ndata is a tensor. And it's not a good practice to save lists of strings as tensor.

Testing

Similar testing are conducted for this version of implementation.

Graph-Learning-Benchmarks / gli