Closed Leirunlin closed 10 months ago
Thanks for your interest in our projects.
cora2_fixed_sbert
after removing the duplicate edges and row-sorting the edge indexpubmed_fixed_tfidf
file to the google drive link. I think you may run some param sweep to fully reproduce the results.citeseer2_fixed_sbert
file. We use Graph Cleaner to fix wrong labels. It only has 3186 nodes since we use the raw text files from https://people.cs.ksu.edu/~ccaragea/russir14/lectures/citeseer.txt. This version can be viewed as CiteSeer-TAG, which has some discrepancies compared to the Planetoid one.Thanks for your suggestions and updates. I've found the bug that the node ID in the new preprocessed dataset is not matched with the PyG one. By the way, are the node IDs in edge_index, node label, and raw_texts matched in the preprocessed dataset? So, we could directly use the raw_texts by querying node_id without relabeling.
And I observe that there are also duplicate edges in PubMed and CiteSeer. It doesn't lead to significant performance differences, but I believe it would be better to remove them.
Hello, CurryTang @CurryTang , thanks again for your sharing. I would like to ask you some questions about cora dataset:
Dear @Leirunlin and @JiazhengZhang,
Thank you for your interest in our projects. I deduplicate the edges in the dataset, and here's the stats
edge number | node number | edge number | node number | ||
---|---|---|---|---|---|
Cora | 10556 | 2708 | Cora-TAG | 10556 | 2708 |
CiteSeer | 3327 | 9104 | CiteSeer-TAG | 3186 | 8450 |
Pubmed | 19717 | 88648 | Pubmed-TAG | 19717 | 88648 |
I regret to inform you that it's not feasible to align the current ID in the TAG version of Cora/Citeseer with the one in the Pyg version. This is due to a couple of key reasons:
Inconsistencies in Raw Data: For the CiteSeer dataset, we've noticed discrepancies in the raw texts, particularly in the number of edges/nodes, which do not precisely align. Anonymity in the Original Pyg Version: The Pyg version maintains anonymity, making it impossible to accurately match each sample on an individual basis.
However, in our benchmark studies, we observed the following:
Cora Benchmarking: The results for Cora on the TAG version are nearly identical to those obtained on the Pyg version. CiteSeer Variations: There are noticeable differences in the results when benchmarking the two versions of CiteSeer. For a more thorough benchmark, it seems imperative to rerun the classic models on the TAG version to achieve comprehensive insights.
For the training masks, we reassign them in the get_dataset
according to different splits.
Hi, @CurryTang , in cora dataset, I found that the raw_texts of some nodes are very short, such as: Node 1408: ' Learning logical definitions from relations. : ' Node 2496: ' Alternative error bounds for the classifier chosen by early stopping, : ' So is there a missing raw_text issue here?
Hi, @CurryTang , in cora dataset, I found that the raw_texts of some nodes are very short, such as: Node 1408: ' Learning logical definitions from relations. : ' Node 2496: ' Alternative error bounds for the classifier chosen by early stopping, : ' So is there a missing raw_text issue here?
In the raw texts, there's only title but no content for these entries.
Hello, CurryTang @CurryTang , thanks again for your sharing. I would like to ask you some questions about cora dataset:
- Standard cora data splition under semi-supervised is 140 (Train)/ 500(Val) /1000 (Test), but cora in this repo is 140/500/2068. I wonder if any mapping relationship between two version datasets.
- Node ID in cora (this repo) is not same as the PyG version, so it is hard to compare with other baselines, could you give some solutions?
Although the node ids in this repo are different from those in PyG, their degree distributions are the same, so for permutation-equivariant GNN models, there should be no difference in their performance.
Thanks for sharing the code and datasets! I have some questions about the datasets: