Open angeload opened 2 months ago
Hello author,
It is impossible to create the protein embedding files because of an error while saving the file.
It seems the embeddings have different shapes and can not be saved together in a single file. Please, can you verify and indicate a solution?
File "/home/angeloduarte/AttentionMGT-DTA/protein_process.py", line 349, in Protein_embedding_process(dataset=dataset, fold=fold, id_train=protein_id_train, id_test=protein_id_test, dir_output=dir_output) File "/home/angeloduarte/AttentionMGT-DTA/protein_process.py", line 301, in Protein_embedding_process np.save(dir_output + '/train/fold/' + str(fold) + '/protein_embedding.npy', proteins_embedding_train, allow_pickle=True) File "<array_function internals>", line 200, in save File "/home/angeloduarte/.pyenv/versions/mgtdta/lib/python3.9/site-packages/numpy/lib/npyio.py", line 521, in save arr = np.asanyarray(arr) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
我认为是numpy版本的问题,我将numpy版本退化到"pip install numpy==1.23.0 ",成功解决这个问题,我用在alphafold数据库中使用uniportID找的的蛋白质,先将其的氨基酸three to one,变为从pdb变为fasta,再用使用esmfold的"python scripts/extract.py esm2_t33_650M_UR50D kiba_sequences.fasta embeddings_output --repr_layers 33 --include per_tok --truncation_seq_length 4096",再将其变为npy格式,重命名为ESM_embedding,成功运行protein_process.py并且没有报错,但是生成的五折交叉检验的train和test非常巨大,居然占了150G的磁盘空间。 I think it is the numpy version problem, I degraded the numpy version to "pip install numpy==1.23.0 ", successfully solved this problem, I used the protein found by uniportID in alphafold database, first three to one of its amino acids, To change from pdb to fasta,
import sys
import os
import pickle
from argparse import ArgumentParser
from Bio.PDB import PDBParser
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from tqdm import tqdm
from Bio import SeqIO
three_to_one = {'ALA': 'A',
'ARG': 'R',
'ASN': 'N',
'ASP': 'D',
'CYS': 'C',
'GLN': 'Q',
'GLU': 'E',
'GLY': 'G',
'HIS': 'H',
'ILE': 'I',
'LEU': 'L',
'LYS': 'K',
'MET': 'M',
'MSE': 'M', # MSE this is almost the same AA as MET. The sulfur is just replaced by Selen
'PHE': 'F',
'PRO': 'P',
'PYL': 'O',
'SER': 'S',
'SEC': 'U',
'THR': 'T',
'TRP': 'W',
'TYR': 'Y',
'VAL': 'V',
'ASX': 'B',
'GLX': 'Z',
'XAA': 'X',
'XLE': 'J'}
parser = ArgumentParser()
parser.add_argument('--out_file', type=str, default="./KIBA_sequences.fasta")
parser.add_argument('--dataset', type=str, default="KIBA")
parser.add_argument('--data_dir', type=str, default='/root/sjb/chem/esm-main/pdb/KIBA/PDB_AF2', help='')
args = parser.parse_args()
biopython_parser = PDBParser()
def get_structure_from_file(file_path):
structure = biopython_parser.get_structure('random_id', file_path)
structure = structure[0]
l = []
for i, chain in enumerate(structure):
seq = ''
for res_idx, residue in enumerate(chain):
if residue.get_resname() == 'HOH':
continue
residue_coords = []
c_alpha, n, c = None, None, None
for atom in residue:
if atom.name == 'CA':
c_alpha = list(atom.get_vector())
if atom.name == 'N':
n = list(atom.get_vector())
if atom.name == 'C':
c = list(atom.get_vector())
if c_alpha != None and n != None and c != None: # only append residue if it is an amino acid
try:
seq += three_to_one[residue.get_resname()]
except Exception as e:
seq += '-'
print("encountered unknown AA: ", residue.get_resname(), ' in the complex ', file_path, '. Replacing it with a dash - .')
l.append(seq)
return l
data_dir = args.data_dir
names = os.listdir(data_dir)
if args.dataset == 'KIBA':
sequences = []
ids = []
for name in tqdm(names):
if name == '.DS_Store': continue
rec_path = os.path.join(data_dir, name)
l = get_structure_from_file(rec_path)
for i, seq in enumerate(l):
sequences.append(seq)
ids.append(f'{name}_chain_{i}')
records = []
for (index, seq) in zip(ids, sequences):
record = SeqRecord(Seq(seq), str(index))
record.description = ''
records.append(record)
SeqIO.write(records, args.out_file, "fasta")
Then use esmfold's "python scripts/extract.py esm2_t33_650M_UR50D kiba_sequences.fasta embeddings_output --repr_layers 33 --include per_tok --truncation_seq_length 4096" and change it to npy format and rename it to ESM_embedding.
import torch
import numpy as np
import os
import glob
def convert_pt_to_npy(pt_file, npy_file):
# 加载 .pt 文件
data = torch.load(pt_file)
# 检查并提取 'representations' 键下的数据
if isinstance(data, dict) and 'representations' in data:
representations = data['representations']
# 确保 'representations' 是一个字典并包含所需层的张量
if isinstance(representations, dict):
for key, value in representations.items():
if isinstance(value, torch.Tensor):
# 保存为 .npy 文件
np.save(npy_file, value.numpy())
print(f"Saved data from {pt_file} (key: {key}) to {npy_file}.")
return
else:
print(f"'representations' key does not contain a dictionary in {pt_file}.")
else:
print(f"'representations' key not found in {pt_file}.")
# 示例使用
pt_folder = '/root/sjb/chem/esm-main/pdb/KIBA/embeddings_output'
npy_folder = '/root/sjb/chem/esm-main/pdb/KIBA/ESM_embedding'
# 确保 npy 文件夹存在
if not os.path.exists(npy_folder):
os.makedirs(npy_folder)
# 处理所有 .pt 文件
pt_files = glob.glob(os.path.join(pt_folder, '*.pt'))
for pt_file in pt_files:
pt_filename = os.path.basename(pt_file)
# 去掉 '.pdb' 和链信息,只保留基础文件名
base_name = pt_filename.split('_')[0]
base_name=base_name.split('.')[0]
npy_file = os.path.join(npy_folder, f"{base_name}.npy")
convert_pt_to_npy(pt_file, npy_file)
protein_process.py successfully runs without error. However, the train and test of the generated 50% cross check are very large, taking up 150G of disk space.
Hello author, It is impossible to create the protein embedding files because of an error while saving the file. It seems the embeddings have different shapes and can not be saved together in a single file. Please, can you verify and indicate a solution? File "/home/angeloduarte/AttentionMGT-DTA/protein_process.py", line 349, in Protein_embedding_process(dataset=dataset, fold=fold, id_train=protein_id_train, id_test=protein_id_test, dir_output=dir_output) File "/home/angeloduarte/AttentionMGT-DTA/protein_process.py", line 301, in Protein_embedding_process np.save(dir_output + '/train/fold/' + str(fold) + '/protein_embedding.npy', proteins_embedding_train, allow_pickle=True) File "<array_function internals>", line 200, in save File "/home/angeloduarte/.pyenv/versions/mgtdta/lib/python3.9/site-packages/numpy/lib/npyio.py", line 521, in save arr = np.asanyarray(arr) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
我认为是numpy版本的问题,我将numpy版本退化到"pip install numpy==1.23.0 ",成功解决这个问题,我用在alphafold数据库中使用uniportID找的的蛋白质,先将其的氨基酸three to one,变为从pdb变为fasta,再用使用esmfold的"python scripts/extract.py esm2_t33_650M_UR50D kiba_sequences.fasta embeddings_output --repr_layers 33 --include per_tok --truncation_seq_length 4096",再将其变为npy格式,重命名为ESM_embedding,成功运行protein_process.py并且没有报错,但是生成的五折交叉检验的train和test非常巨大,居然占了150G的磁盘空间。 I think it is the numpy version problem, I degraded the numpy version to "pip install numpy==1.23.0 ", successfully solved this problem, I used the protein found by uniportID in alphafold database, first three to one of its amino acids, To change from pdb to fasta,
import sys import os import pickle from argparse import ArgumentParser from Bio.PDB import PDBParser from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from tqdm import tqdm from Bio import SeqIO three_to_one = {'ALA': 'A', 'ARG': 'R', 'ASN': 'N', 'ASP': 'D', 'CYS': 'C', 'GLN': 'Q', 'GLU': 'E', 'GLY': 'G', 'HIS': 'H', 'ILE': 'I', 'LEU': 'L', 'LYS': 'K', 'MET': 'M', 'MSE': 'M', # MSE this is almost the same AA as MET. The sulfur is just replaced by Selen 'PHE': 'F', 'PRO': 'P', 'PYL': 'O', 'SER': 'S', 'SEC': 'U', 'THR': 'T', 'TRP': 'W', 'TYR': 'Y', 'VAL': 'V', 'ASX': 'B', 'GLX': 'Z', 'XAA': 'X', 'XLE': 'J'} parser = ArgumentParser() parser.add_argument('--out_file', type=str, default="./KIBA_sequences.fasta") parser.add_argument('--dataset', type=str, default="KIBA") parser.add_argument('--data_dir', type=str, default='/root/sjb/chem/esm-main/pdb/KIBA/PDB_AF2', help='') args = parser.parse_args() biopython_parser = PDBParser() def get_structure_from_file(file_path): structure = biopython_parser.get_structure('random_id', file_path) structure = structure[0] l = [] for i, chain in enumerate(structure): seq = '' for res_idx, residue in enumerate(chain): if residue.get_resname() == 'HOH': continue residue_coords = [] c_alpha, n, c = None, None, None for atom in residue: if atom.name == 'CA': c_alpha = list(atom.get_vector()) if atom.name == 'N': n = list(atom.get_vector()) if atom.name == 'C': c = list(atom.get_vector()) if c_alpha != None and n != None and c != None: # only append residue if it is an amino acid try: seq += three_to_one[residue.get_resname()] except Exception as e: seq += '-' print("encountered unknown AA: ", residue.get_resname(), ' in the complex ', file_path, '. Replacing it with a dash - .') l.append(seq) return l data_dir = args.data_dir names = os.listdir(data_dir) if args.dataset == 'KIBA': sequences = [] ids = [] for name in tqdm(names): if name == '.DS_Store': continue rec_path = os.path.join(data_dir, name) l = get_structure_from_file(rec_path) for i, seq in enumerate(l): sequences.append(seq) ids.append(f'{name}_chain_{i}') records = [] for (index, seq) in zip(ids, sequences): record = SeqRecord(Seq(seq), str(index)) record.description = '' records.append(record) SeqIO.write(records, args.out_file, "fasta")
Then use esmfold's "python scripts/extract.py esm2_t33_650M_UR50D kiba_sequences.fasta embeddings_output --repr_layers 33 --include per_tok --truncation_seq_length 4096" and change it to npy format and rename it to ESM_embedding.
import torch import numpy as np import os import glob def convert_pt_to_npy(pt_file, npy_file): # 加载 .pt 文件 data = torch.load(pt_file) # 检查并提取 'representations' 键下的数据 if isinstance(data, dict) and 'representations' in data: representations = data['representations'] # 确保 'representations' 是一个字典并包含所需层的张量 if isinstance(representations, dict): for key, value in representations.items(): if isinstance(value, torch.Tensor): # 保存为 .npy 文件 np.save(npy_file, value.numpy()) print(f"Saved data from {pt_file} (key: {key}) to {npy_file}.") return else: print(f"'representations' key does not contain a dictionary in {pt_file}.") else: print(f"'representations' key not found in {pt_file}.") # 示例使用 pt_folder = '/root/sjb/chem/esm-main/pdb/KIBA/embeddings_output' npy_folder = '/root/sjb/chem/esm-main/pdb/KIBA/ESM_embedding' # 确保 npy 文件夹存在 if not os.path.exists(npy_folder): os.makedirs(npy_folder) # 处理所有 .pt 文件 pt_files = glob.glob(os.path.join(pt_folder, '*.pt')) for pt_file in pt_files: pt_filename = os.path.basename(pt_file) # 去掉 '.pdb' 和链信息,只保留基础文件名 base_name = pt_filename.split('_')[0] base_name=base_name.split('.')[0] npy_file = os.path.join(npy_folder, f"{base_name}.npy") convert_pt_to_npy(pt_file, npy_file)
protein_process.py successfully runs without error. However, the train and test of the generated 50% cross check are very large, taking up 150G of disk space.
Thanks for the debugging tips! I'll implement them and share the outcome here.
Hello author, It is impossible to create the protein embedding files because of an error while saving the file. It seems the embeddings have different shapes and can not be saved together in a single file. Please, can you verify and indicate a solution? File "/home/angeloduarte/AttentionMGT-DTA/protein_process.py", line 349, in Protein_embedding_process(dataset=dataset, fold=fold, id_train=protein_id_train, id_test=protein_id_test, dir_output=dir_output) File "/home/angeloduarte/AttentionMGT-DTA/protein_process.py", line 301, in Protein_embedding_process np.save(dir_output + '/train/fold/' + str(fold) + '/protein_embedding.npy', proteins_embedding_train, allow_pickle=True) File "<array_function internals>", line 200, in save File "/home/angeloduarte/.pyenv/versions/mgtdta/lib/python3.9/site-packages/numpy/lib/npyio.py", line 521, in save arr = np.asanyarray(arr) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
我认为是numpy版本的问题,我将numpy版本退化到"pip install numpy==1.23.0 ",成功解决这个问题,我用在alphafold数据库中使用uniportID找的的蛋白质,先将其的氨基酸three to one,变为从pdb变为fasta,再用使用esmfold的"python scripts/extract.py esm2_t33_650M_UR50D kiba_sequences.fasta embeddings_output --repr_layers 33 --include per_tok --truncation_seq_length 4096",再将其变为npy格式,重命名为ESM_embedding,成功运行protein_process.py并且没有报错,但是生成的五折交叉检验的train和test非常巨大,居然占了150G的磁盘空间。 I think it is the numpy version problem, I degraded the numpy version to "pip install numpy==1.23.0 ", successfully solved this problem, I used the protein found by uniportID in alphafold database, first three to one of its amino acids, To change from pdb to fasta,
import sys import os import pickle from argparse import ArgumentParser from Bio.PDB import PDBParser from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from tqdm import tqdm from Bio import SeqIO three_to_one = {'ALA': 'A', 'ARG': 'R', 'ASN': 'N', 'ASP': 'D', 'CYS': 'C', 'GLN': 'Q', 'GLU': 'E', 'GLY': 'G', 'HIS': 'H', 'ILE': 'I', 'LEU': 'L', 'LYS': 'K', 'MET': 'M', 'MSE': 'M', # MSE this is almost the same AA as MET. The sulfur is just replaced by Selen 'PHE': 'F', 'PRO': 'P', 'PYL': 'O', 'SER': 'S', 'SEC': 'U', 'THR': 'T', 'TRP': 'W', 'TYR': 'Y', 'VAL': 'V', 'ASX': 'B', 'GLX': 'Z', 'XAA': 'X', 'XLE': 'J'} parser = ArgumentParser() parser.add_argument('--out_file', type=str, default="./KIBA_sequences.fasta") parser.add_argument('--dataset', type=str, default="KIBA") parser.add_argument('--data_dir', type=str, default='/root/sjb/chem/esm-main/pdb/KIBA/PDB_AF2', help='') args = parser.parse_args() biopython_parser = PDBParser() def get_structure_from_file(file_path): structure = biopython_parser.get_structure('random_id', file_path) structure = structure[0] l = [] for i, chain in enumerate(structure): seq = '' for res_idx, residue in enumerate(chain): if residue.get_resname() == 'HOH': continue residue_coords = [] c_alpha, n, c = None, None, None for atom in residue: if atom.name == 'CA': c_alpha = list(atom.get_vector()) if atom.name == 'N': n = list(atom.get_vector()) if atom.name == 'C': c = list(atom.get_vector()) if c_alpha != None and n != None and c != None: # only append residue if it is an amino acid try: seq += three_to_one[residue.get_resname()] except Exception as e: seq += '-' print("encountered unknown AA: ", residue.get_resname(), ' in the complex ', file_path, '. Replacing it with a dash - .') l.append(seq) return l data_dir = args.data_dir names = os.listdir(data_dir) if args.dataset == 'KIBA': sequences = [] ids = [] for name in tqdm(names): if name == '.DS_Store': continue rec_path = os.path.join(data_dir, name) l = get_structure_from_file(rec_path) for i, seq in enumerate(l): sequences.append(seq) ids.append(f'{name}_chain_{i}') records = [] for (index, seq) in zip(ids, sequences): record = SeqRecord(Seq(seq), str(index)) record.description = '' records.append(record) SeqIO.write(records, args.out_file, "fasta")
Then use esmfold's "python scripts/extract.py esm2_t33_650M_UR50D kiba_sequences.fasta embeddings_output --repr_layers 33 --include per_tok --truncation_seq_length 4096" and change it to npy format and rename it to ESM_embedding.
import torch import numpy as np import os import glob def convert_pt_to_npy(pt_file, npy_file): # 加载 .pt 文件 data = torch.load(pt_file) # 检查并提取 'representations' 键下的数据 if isinstance(data, dict) and 'representations' in data: representations = data['representations'] # 确保 'representations' 是一个字典并包含所需层的张量 if isinstance(representations, dict): for key, value in representations.items(): if isinstance(value, torch.Tensor): # 保存为 .npy 文件 np.save(npy_file, value.numpy()) print(f"Saved data from {pt_file} (key: {key}) to {npy_file}.") return else: print(f"'representations' key does not contain a dictionary in {pt_file}.") else: print(f"'representations' key not found in {pt_file}.") # 示例使用 pt_folder = '/root/sjb/chem/esm-main/pdb/KIBA/embeddings_output' npy_folder = '/root/sjb/chem/esm-main/pdb/KIBA/ESM_embedding' # 确保 npy 文件夹存在 if not os.path.exists(npy_folder): os.makedirs(npy_folder) # 处理所有 .pt 文件 pt_files = glob.glob(os.path.join(pt_folder, '*.pt')) for pt_file in pt_files: pt_filename = os.path.basename(pt_file) # 去掉 '.pdb' 和链信息,只保留基础文件名 base_name = pt_filename.split('_')[0] base_name=base_name.split('.')[0] npy_file = os.path.join(npy_folder, f"{base_name}.npy") convert_pt_to_npy(pt_file, npy_file)
protein_process.py successfully runs without error. However, the train and test of the generated 50% cross check are very large, taking up 150G of disk space.
Thanks for the debugging tips! I'll implement them and share the outcome here.
I ran "train_DTA.py" on NVIDIA A10, I adjusted the batchsize to 32, which occupied 16G of video memory, and an epoch was about 300s, which took quite a long time.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10 On | 00000000:A1:00.0 Off | 0 |
| 0% 68C P0 97W / 150W | 16008MiB / 23028MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2395526 C python 16006MiB |
run.log
Training on Davis, fold:1
Epoch Time MSE RMSE CI r2
1 338.2 0.71851 0.84765 0.52184 0.00186
MSE improved at epoch 1 ; best_mse: 0.71850604
2 672.69 0.6213 0.78823 0.62406 0.10358
MSE improved at epoch 2 ; best_mse: 0.6213007
3 1009.48 0.63453 0.79657 0.69185 0.19261
4 1343.18 0.78804 0.88772 0.70158 0.18175
5 1676.81 0.5552 0.74512 0.71946 0.21468
model has been saved
MSE improved at epoch 5 ; best_mse: 0.5551988
6 2010.52 0.54099 0.73552 0.72459 0.24102
model has been saved
MSE improved at epoch 6 ; best_mse: 0.54099154
7 2344.79 0.56821 0.7538 0.73395 0.26291
8 2679.8 0.51267 0.71601 0.74235 0.26899
model has been saved
MSE improved at epoch 8 ; best_mse: 0.51267076
9 3014.65 0.53715 0.73291 0.74662 0.28444
10 3350.2 0.49913 0.70649 0.74512 0.27123
model has been saved
MSE improved at epoch 10 ; best_mse: 0.49912947
11 3684.97 0.59459 0.7711 0.75886 0.29591
12 4020.58 0.46626 0.68283 0.77131 0.32595
model has been saved
MSE improved at epoch 12 ; best_mse: 0.46626085
13 4356.86 0.55538 0.74524 0.77141 0.32872
14 4689.92 0.50944 0.71375 0.77065 0.32065
15 5023.55 0.46043 0.67855 0.77954 0.33824
model has been saved
MSE improved at epoch 15 ; best_mse: 0.46043083
16 5357.15 0.4547 0.67432 0.77686 0.32661
model has been saved
MSE improved at epoch 16 ; best_mse: 0.4547045
17 5690.87 0.44925 0.67026 0.77922 0.33406
model has been saved
MSE improved at epoch 17 ; best_mse: 0.4492497
18 6023.85 0.58687 0.76608 0.77993 0.3159
19 6357.44 0.44637 0.66811 0.78223 0.33424
model has been saved
MSE improved at epoch 19 ; best_mse: 0.4463707
20 6691.65 0.45362 0.67352 0.7877 0.35712
21 7025.6 0.44647 0.66818 0.78955 0.32525
22 7361.13 0.43501 0.65956 0.79135 0.36941
model has been saved
MSE improved at epoch 22 ; best_mse: 0.43501356
23 7698.18 0.43182 0.65713 0.7896 0.34477
model has been saved
MSE improved at epoch 23 ; best_mse: 0.4318245
24 8036.83 0.46952 0.68522 0.79155 0.35701
25 8373.4 0.42616 0.65281 0.79412 0.36326
model has been saved
MSE improved at epoch 25 ; best_mse: 0.4261638
26 8709.37 0.40033 0.63272 0.79835 0.39689
model has been saved
MSE improved at epoch 26 ; best_mse: 0.40032905
27 9046.42 0.44054 0.66373 0.7991 0.34304
28 9382.96 0.41973 0.64787 0.79203 0.36607
29 9719.2 0.47403 0.6885 0.79983 0.37105
30 10056.18 0.42036 0.64835 0.80163 0.39142
31 10390.61 0.46906 0.68488 0.80842 0.36977
32 10724.19 0.41877 0.64712 0.80195 0.39623
33 11058.63 0.3937 0.62746 0.80544 0.41278
I have been running this program for almost two days and have only reached less than 300 epochs
Hello author,
It is impossible to create the protein embedding files because of an error while saving the file.
It seems the embeddings have different shapes and can not be saved together in a single file. Please, can you verify and indicate a solution?
File "/home/angeloduarte/AttentionMGT-DTA/protein_process.py", line 349, in
Protein_embedding_process(dataset=dataset, fold=fold, id_train=protein_id_train, id_test=protein_id_test, dir_output=dir_output)
File "/home/angeloduarte/AttentionMGT-DTA/protein_process.py", line 301, in Protein_embedding_process
np.save(dir_output + '/train/fold/' + str(fold) + '/protein_embedding.npy', proteins_embedding_train, allow_pickle=True)
File "<__array_function__ internals>", line 200, in save
File "/home/angeloduarte/.pyenv/versions/mgtdta/lib/python3.9/site-packages/numpy/lib/npyio.py", line 521, in save
arr = np.asanyarray(arr)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.