Some problems to run the code

zhiqiangzhongddu commented 11 months ago

Hi,

Thanks for the nice work and sharing the code. I'm trying to run your code but have some problems.

When I runWANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1 python -m core.trainLM dataset cora seed 0 >> cora_lm.out, I got data missing error: FileNotFoundError: [Errno 2] No such file or directory: 'dataset/cora_orig/mccallum/cora/extractions/http:##www.cs.ucl.ac.uk#staff#t.yu#ep97.ps'
When I run WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1 python -m core.trainLM dataset ogbn-arxiv seed 0 >> ogbn-arxiv_lm.out, I got File "/home/xxx/TAPE/core/data_utils/load_arxiv.py", line 24, in get_raw_text_arxiv data.edge_index = data.adj_t.to_symmetric() AttributeError: 'Tensor' object has no attribute 'to_symmetric'. I guess it's because of the OGB version conflict, could you tell me your OGB version or update your code? My OGB version is 1.3.6.

XiaoxinHe commented 11 months ago

Hi,

For Q1, could you please check the data sanity with the following command:

python -m core.data_utils.load_cora

The expected output:

~/TAPE$ python -m core.data_utils.load_cora
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...
Done!
Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], num_nodes=2708, train_id=[1624], val_id=[542], test_id=[542])
Title: The megaprior heuristic for discovering protein sequence patterns  
Abstract: Several computer algorithms for discovering patterns in groups of protein sequences are in use that are based on fitting the parameters of a statistical model to a group of related sequences. These include hidden Markov model (HMM) algorithms for multiple sequence alignment, and the MEME and Gibbs sampler algorithms for discovering motifs. These algorithms are sometimes prone to producing models that are incorrect because two or more patterns have been combined. The statistical model produced in this situation is a convex combination (weighted average) of two or more different models. This paper presents a solution to the problem of convex combinations in the form of a heuristic based on using extremely low variance Dirichlet mixture priors as part of the statistical model. This heuristic, which we call the megaprior heuristic, increases the strength (i.e., decreases the variance) of the prior in proportion to the size of the sequence dataset. This causes each column in the final model to strongly resemble the mean of a single component of the prior, regardless of the size of the dataset. We describe the cause of the convex combination problem, analyze it mathematically, motivate and describe the implementation of the megaprior heuristic, and show how it can effectively eliminate the problem of convex combinations in protein sequence pattern discovery.

If you still encounter the same data missing error, could you please redownload the cora_orig files? I recently updated the cora_orig dataset, and your version might be outdated, leading to this issue.

For Q2, my OGB version is 1.3.6, too. It appears to me that the error is not due to a version conflict. During the preprocessing, I use transform=T.ToSparseTensor() to convert the dense adj into a sparse one (check the code here). Only the adjacency matrix data.adj_t has the to_symmetric attribute. If you did not apply the transform=T.ToSparseTensor(), the adj_t is a standard Tensor and thus doesn't have the to_symmetric attribute.

So, I suggest you remove dataset/ogbn_arxiv and rerun the command WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1 python -m core.trainLM dataset ogbn-arxiv seed 0 >> ogbn-arxiv_lm.out

Please let me know if your have further questions. Thanks!

zhiqiangzhongddu commented 11 months ago

Hi,

Thanks for your prompt answers.

Q2 disappeared after re-setting up the environment.

However, Q1 is still the same as before. I downloaded everything on 07/12, thus I think it should be the latest uploaded version. I manually checked the file and found that:

There are two similar items in dataset/cora_orig/mccallum/cora/papers: "http:##www.cs.ucl.ac.uk#staff#T.Yu#ep97.ps" and "http:##www.cs.ucl.ac.uk#staff#t.yu#ep97.ps".
At the folder dataset/cora_orig/mccallum/cora/extractions/, there is one file "http/##www.cs.ucl.ac.uk#staff#T.Yu#ep97.ps" but not "http/##www.cs.ucl.ac.uk#staff#t.yu#ep97.ps".

Hope it helps.

XiaoxinHe commented 10 months ago

Hi,

Sorry for the late reply. Could you please check if the discussions in #5 are helpful?

ChenS676 commented 9 months ago

hallo I also have this issue in the linux system, I think it is essentially caused by the invisible setting of the path extraction. I can paste my code here, hopefully, it helps! It already works for me with the current version of the dataset. @zhiqiangzhongddu

`def parse_cora(): path = root_path + 'dataset/cora_orig/cora' idx_features_labels = np.genfromtxt( "{}.content".format(path), dtype=np.dtype(str)) data_X = idx_features_labels[:, 1:-1].astype(np.float32) labels = idx_features_labels[:, -1] class_map = {x: i for i, x in enumerate(['Case_Based', 'Genetic_Algorithms', 'Neural_Networks', 'Probabilistic_Methods', 'Reinforcement_Learning', 'Rule_Learning', 'Theory'])} data_Y = np.array([class_map[l] for l in labels]) data_citeid = idx_features_labels[:, 0] idx = np.array(data_citeid, dtype=np.dtype(str)) idx_map = {j: i for i, j in enumerate(idx)} edges_unordered = np.genfromtxt( "{}.cites".format(path), dtype=np.dtype(str)) edges = np.array(list(map(idx_map.get, edges_unordered.flatten()))).reshape( edges_unordered.shape) data_edges = np.array(edges[~(edges == None).max(1)], dtype='int') data_edges = np.vstack((data_edges, np.fliplr(data_edges))) return data_X, data_Y, data_citeid, np.unique(data_edges, axis=0).transpose()

def get_raw_text_cora(use_text=False, seed=0): data, data_citeid = get_cora_casestudy(seed) if not use_text: return data, None

with open('dataset/cora_orig/mccallum/cora/papers')as f:
    lines = f.readlines()
pid_filename = {}
for line in lines:
    pid = line.split('\t')[0]
    fn = line.split('\t')[1]
    pid_filename[pid] = fn

path = 'dataset/cora_orig/mccallum/cora/extractions/'
# path = 'dataset/cora/extractions/'
values = os.listdir(path)
with open("extraction.txt", 'w') as output:
    for row in values:
        output.write(str(row) + '\n')

text = []
not_loaded = []
i = 0
for pid in data_citeid:
    fn = pid_filename[pid]
    try:
        if os.path.exists(path+fn): 
            pathfn = path+fn
        elif os.path.exists(path+fn.replace(":", "_")):
            pathfn = path+fn.replace(":", "_")
        elif os.path.exists(path+fn.replace("_", ":")):
            pathfn = path+fn.replace("_", ":")

        with open(pathfn) as f:
            lines = f.read().splitlines()

        for line in lines:
            if 'Title:' in line:
                ti = line
            if 'Abstract:' in line:
                ab = line
        text.append(ti+'\n'+ab)
    except:
        not_loaded.append(pathfn)
        i += 1

    # print(f"not loaded {i} papers.")
    # print(f"not loaded papers: {not_loaded}")
return data, text`

XiaoxinHe / TAPE

Some problems to run the code #10