Question on WIKI and Reddit Data

amazon-science / tgl

Apache License 2.0

192 stars 31 forks source link

Question on WIKI and Reddit Data #14

Closed yw6vp closed 2 years ago

yw6vp commented 2 years ago

Hello, my question is closely related to this issue: https://github.com/amazon-research/tgl/issues/5.

Basically, I'd like to confirm if my understanding is correct. In the issue above, it said both WIKI and Reddit graphs are undirected. Take WIKI data as an example, does that mean if an user U edited a page P at time T, there will be two edges in edges.csv: 1. U as source node and P as destination node with timestamp T 2. P as source node and U as destination node with timestamp T? So if we start with a bipartite graph where source nodes are always 1 type and destination nodes are always of the other type, we basically need to preprocess the bipartite graphs to add a reverse copy for each edge, is that correct?

tedzhouhk commented 2 years ago

Yes, this is correct. If we do not add the reverse edge, then the node in one partition would never have neighbors.

yw6vp commented 2 years ago

Thank you, that makes sense.

yw6vp commented 2 years ago

Hello again, I downloaded the edges.csv for WIKI using the provided code in down.sh. As I understood from our previous conversation, edges.csv should already contain reverse links: edge 1 (src) -> 10 (dst) should have a reverse copy as 10 (src) -> 1 (dst). But after looking at the downloaded edges.csv, the set of source nodes has no overlap with the set of dst nodes, indicating no reverse links have been added, can you help me understand how do you make sure WIKI graph is undirected? Thanks!

tedzhouhk commented 2 years ago

Hi, edges.csv does not have added reversed links. The reversed links are added in the generated T-CSR data structure ("--add_reverse" flag in gen_graph.py).

yw6vp commented 2 years ago

Got it, I was actually just checking gen_graph.py and saw that option. Thanks for the really quick response!

So just to confirm, even after data preprocessing, edges.csv doesn't have reverse links, the T-CSR data structures (all the *.npz files) are the only only files containing reverse links right?

Then in train.py, only samplers are aware of the reverse links so they can collect neighbors for all node types, the rest of the code that iterates through edges just follows edges.csv chronologically, correct?

tedzhouhk commented 2 years ago

Right.