possible data leakage - Githubissues

towardsagi commented 1 year ago

Hi authors, Thanks for your excellent works. However i have met some troubles in reproducing the results reported in paper. I found that there are two points may cause data leakage:

1. data leakage from adjacency matrix

https://github.com/CongWeilin/GraphMixer/blob/bbdd9923706a02d0619dfd00ef7880911f003a65/link_pred_train_utils.py#L200-L246 In line 239, the one-hot node feats was multiplied by the normed adjacency matrix, which makes the adj_norm becomes the input node features. However, the adj_norm contains link information that we are going to predict! Therefore, i have tried to change line 239 to:

sign_feats.append(sign_feats[-1])

2. data leakage from binary search

in sampler.cpp, binary search has been used to accelerate the selection of recent history neighbors. However, the process of positive sampling and negative sampling should be different, for the reason that positive neighbors boundary is right on the target timestamp, and negative neighbors boundary is less than the target timestamp. But the sampler logic did not consider this issue:

https://github.com/CongWeilin/GraphMixer/blob/bbdd9923706a02d0619dfd00ef7880911f003a65/sampler_core.cpp#L295-L306

This issue can change the maximum number of history neighbors between positive and negative samples(max_num_positive + 1 = max_num_negative). It may not influence much for other algorithms, as long as the the sampled neighbors are correct. However, for Graphmixer with zero padding inputs, this difference can be easily captured by the mlp-mixer block.

Therefore, i have tried to insert the following codes to change the for-loop:

int _offset = 0;
if (ts[e_search] == nts + offset) _offset = 1;  // positive is on the boundary, so add one neighbor
// no sampling, pick recent neighbors
for (EdgeIDType k = e_search; k > std::max(s_search, e_search - neighs - _offset); k--)

After apply the above two fix, i rerun the code on WIKI and LASTFM dataset. And it turns out for WIKI the average precision(AP) can drop from 99.85 to 99.04, for LASTFM the AP can drop from 96.31 to 86.60. I spend a lot of time on reproducing this project. So i submit this issue and share my thoughts with authors and other readers to discuss whether the above possible data leakage is correct and whether there are more leakage points not discovered.

towardsagi commented 1 year ago

@CongWeilin Hi, are there any progress

CongWeilin commented 1 year ago

Thank you for carefully checking the code.

For your first question, the edge_index is constructed by edges that happens before the current timestep, I am not sure why it could have information on current timestep when using adj_norm

For your second question, I think this is related to https://github.com/amazon-science/tgl/issues/7, and we are using their sampler. I don't fully understand why a same sampler could hehave differently based on pos vs neg nodes, I am not expert on C++ ... Besides, your inserted code seems taking whether the node pos vs neg label into consideration when sampling. You might want to check with TGL's author offline. Thanks.

towardsagi commented 1 year ago

thanks for your reply!

i see, for the first question, the adj_norm did not introduce future information. Actually, removing it does not impact performance. you could have a try!

however, it remains a lot to discuss for the second question. The model tries to distinguish between positive pairs and negative pairs. What if there are significant distribution difference between these two parts? Let's lookup the intermediate file directly. For example:

import numpy as np
import pickle
import seaborn as sns

# chunks = 1 src + 1 dst + 5 neg
def plot_distribution(fn, chunks = 7):
    with open(fn, 'rb') as f:
        data = pickle.load(f)
    nums = []
    for _ in range(chunks):
        nums.append([])
    for batch in data:
        assert len(batch) % chunks == 0
        bs = int(len(batch) / chunks)
        for i in range(chunks):
            s,t = i*bs, (i+1)*bs
            for d in batch[s:t]:
                nums[i].append(d['num_edges'])
    x = np.array(nums)
    print(x.max(axis=1))
    sns.boxplot(x.T)

fn = 'DATA/WIKI/train_neg_sample_neg5_bs600_hops1_neighbors30.pickle'
plot_distribution(fn)

i put the result here https://gist.github.com/towardsagi/09913ab1b809a553f059c57c0fae87b6

It's clear that negative samples obtains a different history compared with positive ones, which introduce bias from data construction. Therefore it can have a good result even the MixerBlock only models the zeros count from the input. Also, the paper said that zero padding is essential for the performance, right? @CongWeilin

towardsagi commented 1 year ago

dear authors, are there any progress?

recently i find negative sampling may be the root cause of the distribution shift between positive and negative samples. The pyg official implementation should the correct one, for the dst nodes should not start from 0.

here are the code snippets: https://github.com/pyg-team/pytorch_geometric/blob/b825dc637a39dbf463bbd0fa486cd559f8d665ee/torch_geometric/loader/temporal_dataloader.py#L55-L63

and GraphMixer do negative sampling just from index 0. https://github.com/CongWeilin/GraphMixer/blob/c84f1e0bee4eed848872a966b8166d741e240713/construct_subgraph.py#L9-L17

towardsagi commented 1 year ago

@CongWeilin

CongWeilin / GraphMixer

possible data leakage #4

1. data leakage from adjacency matrix

2. data leakage from binary search