Closed BowenYao18 closed 1 month ago
I will update the edge file on the s3 bucket with the local copy soon. It should have the right number of edges (3995777033). Thanks for bringing this to our attention.
I will update the edge file on the s3 bucket with the local copy soon. It should have the right number of edges (3995777033). Thanks for bringing this to our attention.
Thank you. After the dataset being updated, I should be able to download through the original link? wget https://igb-public-awsopen.s3.amazonaws.com/IGBH/processed/paper__cites__paper/edge_index.npy
Hi, if its urgent please use this file as a temporary solution. This is the last 268,681,203 edges. I will upload the edge_index.npy
file as soon as I can to the s2.
Is the edges you are updating a simple remove self edge followed by adding self edge of each node? edges = add_self_edges(remove_self_edge(edges))
No these should be edges between different nodes. You can run this edges = add_self_edges(remove_self_edge(edges))
as a preprocessing step for your usecase.
Thank you. Also, I assume het dataset has different papercitespaper edges from hom? Will you also update the igb-het papercitespaper?
Both the datasets have the same paper nodes and paper_edges. The het dataset just has more types of nodes and types of edges.
You can reuse the same edges for both datasets.
However, the repo writes that there are 3995777033 edges for the igb-hom and 3996442004 edges for igb-het. Should they be different sets of edges?
Also, I don't know if this is a coincidence. but if you try to run edges = add_self_edges(remove_self_edge(edges))
on the papercitespaper edge of the igb-hom, the result will have 3996442004 (the number of edges that the igb-het written in the above graph).
I believe the number 3996442004
should be the total edges we finally published (including the self edges). There are some inconsistencies between parts of the repo and the paper due to the difference in different internal versions of the full dataset.
Thanks for pointing it out so I could take a second look. You shouldn't need to use the extra edges as that shouldn't be part of the final dataset. Please use the edges = add_self_edges(remove_self_edge(edges))
and this will the expected edges for the full dataset (homogeneous and heterogeneous).
Describe the bug The num edges in paper is 3995777033 in paper but the actual number of edges I download is 3727095830.
To Reproduce
Below is the download command: wget https://igb-public-awsopen.s3.amazonaws.com/IGBH/processed/paper__cites__paper/edge_index.npy Then, "array = np.load("/path/to/dataset", mmap_mode='r+')" to load the downloaded file and check "arr.shape"
Expected behavior The shape should be (3727095830, 2), which does not match 3995777033 reported in paper. This is the link to the paper: https://arxiv.org/pdf/2302.13522
Screenshots This is the IGB-HOM info table:
Software information:
Additional context Add any other context about the problem here.