Questions about dataset.

Leirunlin commented 11 months ago

Thanks for sharing the code and datasets! I have some questions about the datasets:

The number of processed edges is not matched with the PyG version. For Cora, the number of edges in PyG is 10556, but appears to be 10858 here. It seems that there are duplicated edges that have not been removed. Are there any steps addressing this? If not, I suggest using torch.geometric.utils.coalesce() to remove such edges.
I try to reproduce the results of shallow embeddings with the following steps. I first run generate_pyg_data.py with embedding=['tfidf'] for cora, citeseer, PubMed. I obtained cora_fixed_tfidf.pt file. And I apply a 2-layer GCN on the obtained .pt dataset with standard training pipeline. However, for PubMed, GCN with the obtained embedding only achieves 54% accuracy, which is far behind the 79% accuracy reported. Did I miss any other preprocessing steps (like normalization)? How can I reproduce the results in Table 1?
Did you try experiments on CiteSeer? And why there are only 3186 nodes in Citeseer?
The default embedding for Cora and Citeseer are BoW. Did you try them as shallow embedding methods?

CurryTang commented 10 months ago

Thanks for your interest in our projects.

Thanks for your suggestions, and we upload the new cora2_fixed_sbert after removing the duplicate edges and row-sorting the edge index
I've uploaded the pubmed_fixed_tfidf file to the google drive link. I think you may run some param sweep to fully reproduce the results.
We've tried CiteSeer. If you want to use this dataset, we recommend you use the citeseer2_fixed_sbert file. We use Graph Cleaner to fix wrong labels. It only has 3186 nodes since we use the raw text files from https://people.cs.ksu.edu/~ccaragea/russir14/lectures/citeseer.txt. This version can be viewed as CiteSeer-TAG, which has some discrepancies compared to the Planetoid one.
For Cora and Citeseer, we use TF-IDF for the shallow embedding.

Leirunlin commented 10 months ago

Thanks for your suggestions and updates. I've found the bug that the node ID in the new preprocessed dataset is not matched with the PyG one. By the way, are the node IDs in edge_index, node label, and raw_texts matched in the preprocessed dataset? So, we could directly use the raw_texts by querying node_id without relabeling.

And I observe that there are also duplicate edges in PubMed and CiteSeer. It doesn't lead to significant performance differences, but I believe it would be better to remove them.

JiazhengZhang commented 10 months ago

Hello, CurryTang @CurryTang , thanks again for your sharing. I would like to ask you some questions about cora dataset:

Standard cora data splition under semi-supervised is 140 (Train)/ 500(Val) /1000 (Test), but cora in this repo is 140/500/2068. I wonder if any mapping relationship between two version datasets.
Node ID in cora (this repo) is not same as the PyG version, so it is hard to compare with other baselines, could you give some solutions?

CurryTang commented 10 months ago

Dear @Leirunlin and @JiazhengZhang,

Thank you for your interest in our projects. I deduplicate the edges in the dataset, and here's the stats

	edge number	node number		edge number	node number
Cora	10556	2708	Cora-TAG	10556	2708
CiteSeer	3327	9104	CiteSeer-TAG	3186	8450
Pubmed	19717	88648	Pubmed-TAG	19717	88648

I regret to inform you that it's not feasible to align the current ID in the TAG version of Cora/Citeseer with the one in the Pyg version. This is due to a couple of key reasons:

Inconsistencies in Raw Data: For the CiteSeer dataset, we've noticed discrepancies in the raw texts, particularly in the number of edges/nodes, which do not precisely align. Anonymity in the Original Pyg Version: The Pyg version maintains anonymity, making it impossible to accurately match each sample on an individual basis.

However, in our benchmark studies, we observed the following:

Cora Benchmarking: The results for Cora on the TAG version are nearly identical to those obtained on the Pyg version. CiteSeer Variations: There are noticeable differences in the results when benchmarking the two versions of CiteSeer. For a more thorough benchmark, it seems imperative to rerun the classic models on the TAG version to achieve comprehensive insights.

For the training masks, we reassign them in the get_dataset according to different splits.

zhongjian-zhang commented 7 months ago

Hi, @CurryTang , in cora dataset, I found that the raw_texts of some nodes are very short, such as: Node 1408: ' Learning logical definitions from relations. : ' Node 2496: ' Alternative error bounds for the classifier chosen by early stopping, : ' So is there a missing raw_text issue here?

CurryTang commented 7 months ago

Hi, @CurryTang , in cora dataset, I found that the raw_texts of some nodes are very short, such as: Node 1408: ' Learning logical definitions from relations. : ' Node 2496: ' Alternative error bounds for the classifier chosen by early stopping, : ' So is there a missing raw_text issue here?

In the raw texts, there's only title but no content for these entries.

Lukangkang123 commented 6 days ago

Hello, CurryTang @CurryTang , thanks again for your sharing. I would like to ask you some questions about cora dataset:

Standard cora data splition under semi-supervised is 140 (Train)/ 500(Val) /1000 (Test), but cora in this repo is 140/500/2068. I wonder if any mapping relationship between two version datasets.

Node ID in cora (this repo) is not same as the PyG version, so it is hard to compare with other baselines, could you give some solutions?

Although the node ids in this repo are different from those in PyG, their degree distributions are the same, so for permutation-equivariant GNN models, there should be no difference in their performance.

CurryTang / Graph-LLM

Questions about dataset. #7