[Dataset] IGB-HOM dataset wrong number of edges

IllinoisGraphBenchmark / IGB-Datasets

Largest realworld open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research.

https://arxiv.org/abs/2302.13522

Other

75 stars 11 forks source link

[Dataset] IGB-HOM dataset wrong number of edges #55

Closed BowenYao18 closed 1 month ago

BowenYao18 commented 1 month ago

Describe the bug The num edges in paper is 3995777033 in paper but the actual number of edges I download is 3727095830.

To Reproduce

Below is the download command: wget https://igb-public-awsopen.s3.amazonaws.com/IGBH/processed/paper__cites__paper/edge_index.npy Then, "array = np.load("/path/to/dataset", mmap_mode='r+')" to load the downloaded file and check "arr.shape"

Expected behavior The shape should be (3727095830, 2), which does not match 3995777033 reported in paper. This is the link to the paper: https://arxiv.org/pdf/2302.13522

Screenshots This is the IGB-HOM info table:

Software information:

OS, ...

Additional context Add any other context about the problem here.

akhatua2 commented 1 month ago

I will update the edge file on the s3 bucket with the local copy soon. It should have the right number of edges (3995777033). Thanks for bringing this to our attention.

BowenYao18 commented 1 month ago

I will update the edge file on the s3 bucket with the local copy soon. It should have the right number of edges (3995777033). Thanks for bringing this to our attention.

Thank you. After the dataset being updated, I should be able to download through the original link? wget https://igb-public-awsopen.s3.amazonaws.com/IGBH/processed/paper__cites__paper/edge_index.npy

akhatua2 commented 1 month ago

Hi, if its urgent please use this file as a temporary solution. This is the last 268,681,203 edges. I will upload the edge_index.npy file as soon as I can to the s2.

BowenYao18 commented 1 month ago

Is the edges you are updating a simple remove self edge followed by adding self edge of each node? edges = add_self_edges(remove_self_edge(edges))

akhatua2 commented 1 month ago

No these should be edges between different nodes. You can run this edges = add_self_edges(remove_self_edge(edges)) as a preprocessing step for your usecase.

BowenYao18 commented 1 month ago

Thank you. Also, I assume het dataset has different papercitespaper edges from hom? Will you also update the igb-het papercitespaper?

akhatua2 commented 1 month ago

Both the datasets have the same paper nodes and paper_edges. The het dataset just has more types of nodes and types of edges.

You can reuse the same edges for both datasets.

BowenYao18 commented 1 month ago

However, the repo writes that there are 3995777033 edges for the igb-hom and 3996442004 edges for igb-het. Should they be different sets of edges?

BowenYao18 commented 1 month ago

Also, I don't know if this is a coincidence. but if you try to run edges = add_self_edges(remove_self_edge(edges)) on the papercitespaper edge of the igb-hom, the result will have 3996442004 (the number of edges that the igb-het written in the above graph).

akhatua2 commented 1 month ago

I believe the number 3996442004 should be the total edges we finally published (including the self edges). There are some inconsistencies between parts of the repo and the paper due to the difference in different internal versions of the full dataset.

The het and hom datasets should have the same number of paper edges. We used the edges + self loops count for our initial benchmark runs.

Thanks for pointing it out so I could take a second look. You shouldn't need to use the extra edges as that shouldn't be part of the final dataset. Please use the edges = add_self_edges(remove_self_edge(edges)) and this will the expected edges for the full dataset (homogeneous and heterogeneous).