Closed zhiqiangzhongddu closed 7 months ago
Hi,
For Q1, could you please check the data sanity with the following command:
python -m core.data_utils.load_cora
The expected output:
~/TAPE$ python -m core.data_utils.load_cora
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...
Done!
Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708], num_nodes=2708, train_id=[1624], val_id=[542], test_id=[542])
Title: The megaprior heuristic for discovering protein sequence patterns
Abstract: Several computer algorithms for discovering patterns in groups of protein sequences are in use that are based on fitting the parameters of a statistical model to a group of related sequences. These include hidden Markov model (HMM) algorithms for multiple sequence alignment, and the MEME and Gibbs sampler algorithms for discovering motifs. These algorithms are sometimes prone to producing models that are incorrect because two or more patterns have been combined. The statistical model produced in this situation is a convex combination (weighted average) of two or more different models. This paper presents a solution to the problem of convex combinations in the form of a heuristic based on using extremely low variance Dirichlet mixture priors as part of the statistical model. This heuristic, which we call the megaprior heuristic, increases the strength (i.e., decreases the variance) of the prior in proportion to the size of the sequence dataset. This causes each column in the final model to strongly resemble the mean of a single component of the prior, regardless of the size of the dataset. We describe the cause of the convex combination problem, analyze it mathematically, motivate and describe the implementation of the megaprior heuristic, and show how it can effectively eliminate the problem of convex combinations in protein sequence pattern discovery.
If you still encounter the same data missing error, could you please redownload the cora_orig files? I recently updated the cora_orig dataset, and your version might be outdated, leading to this issue.
For Q2, my OGB version is 1.3.6, too. It appears to me that the error is not due to a version conflict. During the preprocessing, I use transform=T.ToSparseTensor()
to convert the dense adj into a sparse one (check the code here). Only the adjacency matrix data.adj_t
has the to_symmetric
attribute. If you did not apply the transform=T.ToSparseTensor()
, the adj_t
is a standard Tensor and thus doesn't have the to_symmetric
attribute.
So, I suggest you remove dataset/ogbn_arxiv and rerun the command WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1 python -m core.trainLM dataset ogbn-arxiv seed 0 >> ogbn-arxiv_lm.out
Please let me know if your have further questions. Thanks!
Hi,
Thanks for your prompt answers.
Q2 disappeared after re-setting up the environment.
However, Q1 is still the same as before. I downloaded everything on 07/12, thus I think it should be the latest uploaded version. I manually checked the file and found that:
Hope it helps.
Hi,
Sorry for the late reply. Could you please check if the discussions in #5 are helpful?
hallo I also have this issue in the linux system, I think it is essentially caused by the invisible setting of the path extraction. I can paste my code here, hopefully, it helps! It already works for me with the current version of the dataset. @zhiqiangzhongddu
`def parse_cora(): path = root_path + 'dataset/cora_orig/cora' idx_features_labels = np.genfromtxt( "{}.content".format(path), dtype=np.dtype(str)) data_X = idx_features_labels[:, 1:-1].astype(np.float32) labels = idx_features_labels[:, -1] class_map = {x: i for i, x in enumerate(['Case_Based', 'Genetic_Algorithms', 'Neural_Networks', 'Probabilistic_Methods', 'Reinforcement_Learning', 'Rule_Learning', 'Theory'])} data_Y = np.array([class_map[l] for l in labels]) data_citeid = idx_features_labels[:, 0] idx = np.array(data_citeid, dtype=np.dtype(str)) idx_map = {j: i for i, j in enumerate(idx)} edges_unordered = np.genfromtxt( "{}.cites".format(path), dtype=np.dtype(str)) edges = np.array(list(map(idx_map.get, edges_unordered.flatten()))).reshape( edges_unordered.shape) data_edges = np.array(edges[~(edges == None).max(1)], dtype='int') data_edges = np.vstack((data_edges, np.fliplr(data_edges))) return data_X, data_Y, data_citeid, np.unique(data_edges, axis=0).transpose()
def get_raw_text_cora(use_text=False, seed=0): data, data_citeid = get_cora_casestudy(seed) if not use_text: return data, None
with open('dataset/cora_orig/mccallum/cora/papers')as f:
lines = f.readlines()
pid_filename = {}
for line in lines:
pid = line.split('\t')[0]
fn = line.split('\t')[1]
pid_filename[pid] = fn
path = 'dataset/cora_orig/mccallum/cora/extractions/'
# path = 'dataset/cora/extractions/'
values = os.listdir(path)
with open("extraction.txt", 'w') as output:
for row in values:
output.write(str(row) + '\n')
text = []
not_loaded = []
i = 0
for pid in data_citeid:
fn = pid_filename[pid]
try:
if os.path.exists(path+fn):
pathfn = path+fn
elif os.path.exists(path+fn.replace(":", "_")):
pathfn = path+fn.replace(":", "_")
elif os.path.exists(path+fn.replace("_", ":")):
pathfn = path+fn.replace("_", ":")
with open(pathfn) as f:
lines = f.read().splitlines()
for line in lines:
if 'Title:' in line:
ti = line
if 'Abstract:' in line:
ab = line
text.append(ti+'\n'+ab)
except:
not_loaded.append(pathfn)
i += 1
# print(f"not loaded {i} papers.")
# print(f"not loaded papers: {not_loaded}")
return data, text`
Hi,
Thanks for the nice work and sharing the code. I'm trying to run your code but have some problems.
WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1 python -m core.trainLM dataset cora seed 0 >> cora_lm.out
, I got data missing error:FileNotFoundError: [Errno 2] No such file or directory: 'dataset/cora_orig/mccallum/cora/extractions/http:##www.cs.ucl.ac.uk#staff#t.yu#ep97.ps'
WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1 python -m core.trainLM dataset ogbn-arxiv seed 0 >> ogbn-arxiv_lm.out
, I gotFile "/home/xxx/TAPE/core/data_utils/load_arxiv.py", line 24, in get_raw_text_arxiv data.edge_index = data.adj_t.to_symmetric() AttributeError: 'Tensor' object has no attribute 'to_symmetric'
. I guess it's because of the OGB version conflict, could you tell me your OGB version or update your code? My OGB version is 1.3.6.