Closed wujialu closed 11 months ago
Dear wujialu,
I downloaded the file "data/splits/df_train_with_ESM1b_ts.pkl" manually from the GitHub repository and I could open it wiothout any error. The file size (~100MB) seems also fine. Maybe there was an error while you downloaded the file? Can you try to manually download the file from the GitHub repository and replace the file that you downloaded before?
Dear luohuiqian,
I assume you are referring to the datasets_PubChem variable in "ESP/notebooks_and_code/additional_code/data_preprocessing.py"? If yes, there was indeed a missing folder. But I adjusted the data_preprocessing.py code some weeks ago and I uploaded an additional folder to zenodo (https://doi.org/10.5281/zenodo.8016269) with the missing files (see also Readme page of this repo and issue3: https://github.com/AlexanderKroll/ESP/issues/3 i hope this helps
Dear luohuiqian,
I assume you are referring to the datasets_PubChem variable in "ESP/notebooks_and_code/additional_code/data_preprocessing.py"? If yes, there was indeed a missing folder. But I adjusted the data_preprocessing.py code some weeks ago and I uploaded an additional folder to zenodo (https://doi.org/10.5281/zenodo.8016269) with the missing files (see also Readme page of this repo and issue3: #3 i hope this helps
Thank you for your help. But i still found something missing in your dataset. when i run the code"df_GO_UID = pd.read_csv(join(CURRENT_DIR, "alex_data", "df_GO_UID.csv"))", i can't find the file of "df_GO_UID.csv" in your dataset. If convenient, i wish you can upload the data soon.
Dear Alexander Kroll
Thank you for your help. But i still found something missing in your dataset. when i run the code"df_GO_UID = pd.read_csv(join(CURRENT_DIR, "alex_data", "df_GO_UID.csv"))", i can't find the file of "df_GO_UID.csv" in your dataset. If convenient, i wish you can upload the data soon.
At 2023-07-05 18:09:48, "Alexander Kroll" @.***> wrote:
Dear luohuiqian,
I assume you are referring to the datasets_PubChem variable in "ESP/notebooks_and_code/additional_code/data_preprocessing.py"? If yes, there was indeed a missing folder. But I adjusted the data_preprocessing.py code some weeks ago and I uploaded an additional folder to zenodo (https://doi.org/10.5281/zenodo.8016269) with the missing files (see also Readme page of this repo and issue3: #3 i hope this helps
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Dear luohuiqian,
I searched in this GitHub repository, but I could not find the line "df_GO_UID = pd.read_csv(join(CURRENT_DIR, "alex_data", "df_GO_UID.csv"))". Can you tell my in which of the files this line of code is?
Dear Alexander Kroll I'm very sorry, I seem to have made a mistake. I didn't find the error in the code mentioned above, but when I ran this code "rep_dict=torch. load (join (CURRENT_DIR,".. "," data "," enzyme_data "," all_sequences_enz_sub. pt ")", I found some missing data. I didn't find this file in your data folder.(The following figure shows the code you provided)
At 2023-07-18 15:25:22, "Alexander Kroll" @.***> wrote:
Dear luohuiqian,
I searched in this GitHub repository, but I could not find the line "df_GO_UID = pd.read_csv(join(CURRENT_DIR, "alex_data", "df_GO_UID.csv"))". Can you tell my in which of the files this line of code is?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Dear luohuiqian,
I did not upload this file, because of its large size. But as I write in the text above the cell, you have all data to create this file on your own:
" To calculate the ESM-1b vectors, we used the model and code provided by the Facebook Research team: https://github.com/facebookresearch/esm. The following command line was used to calculate the representations:
python extract.py esm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps--repr_layers 33 --include mean
" If you should not manage to create the file on your own, please let me know. In this case, I can either help you or upload the file to some repository.
I'm sorry, because the data and batch contained in the 'all_sequences. fasta' file are too large, it seems difficult for my computer to run the instruction code you provided. If possible, I hope you can provide me with this file directly. Thank you very much.
At 2023-07-26 18:57:23, "Alexander Kroll" @.***> wrote:
Dear luohuiqian,
I did not upload this file, because of its large size. But as I write in the text above the cell, you have all data to create this file on your own:
" To calculate the ESM-1b vectors, we used the model and code provided by the Facebook Research team: https://github.com/facebookresearch/esm. The following command line was used to calculate the representations:
pythonextract.pyesm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps--repr_layers33--includemean
" If you should not manage to create the file on your own, please let me know. In this case, I can either help you or upload the file to some repository.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Dear luohuiqian,
I will create and upload the file for you in the next few days, but can I first ask for which purpose you need those files? The code in this repository is meant for re-training the model from our paper, but if you have trouble to create the fasta or ESM-1b file, you might also not be able to reproduce all other necessary steps to repeat the analyses from the paper. If you are simply interested in using/applying the trained model, I created a separate repository that allows an easy use of the trained model in Python, which automatically creates the required representations for all enzymes: https://github.com/AlexanderKroll/ESP_prediction_function Alternatively, you can use our web server which you can directly run in any browser without any requirements: https://esp.cs.hhu.de/
Dear Alexander Kroll,
I ran the following command line
python extract.py esm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps --repr_layers 33 --include mean
To calculate the ESM-1b vectors
now i have a folder all_sequences_esm1b_reps
contain all .pt files. i have doubt how to merge them into the single file named all_sequences_enz_sub.pt
. would you please share the code which you had done this step.
Kind regards
Vahid Atabaigielmi
Dear Alexander Kroll,
I ran the following command line
python extract.py esm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps --repr_layers 33 --include mean
To calculate the ESM-1b vectors now i have a folder
all_sequences_esm1b_reps
contain all .pt files. i have doubt how to merge them into the single file namedall_sequences_enz_sub.pt
. would you please share the code which you had done this step. Kind regards Vahid Atabaigielmi
to run: wget -c https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
python extract.py esm1b_t33_650M_UR50S ./all_sequences.fasta ./output_dir --include mean
Dear Alexander Kroll, I ran the following command line
python extract.py esm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps --repr_layers 33 --include mean
To calculate the ESM-1b vectors now i have a folder
all_sequences_esm1b_reps
contain all .pt files. i have doubt how to merge them into the single file namedall_sequences_enz_sub.pt
. would you please share the code which you had done this step. Kind regards Vahid Atabaigielmito run: wget -c https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
python extract.py esm1b_t33_650M_UR50S ./all_sequences.fasta ./output_dir --include mean
Dear @xukaili I hope this message finds you well. In reference to the command line you provided for extracting ESM-1b vectors, I have successfully obtained individual representations stored as .pt files in the directory all_sequences_esm1b_reps. Now, I need assistance in merging these individual files into a single file named all_sequences_enz_sub.pt.
Dear Alexander Kroll Thank you very much for your work. I encountered the following problems in the process of reproducing your article. First of all, when I used the data set in your program to input into the esm-1b model, I found that many protein amino acid sequences in the data set were too long, which exceeded the dimension required by esm-1b (value error: sequence length 1657 about maximum sequence length of 1024). So, I'd like to ask how you operate it. Thank you very much for your answer. I will follow your research.
Dear Alexander Kroll Thank you very much for your work. I encountered the following problems in the process of reproducing your article. First of all, when I used the data set in your program to input into the esm-1b model, I found that many protein amino acid sequences in the data set were too long, which exceeded the dimension required by esm-1b (value error: sequence length 1657 about maximum sequence length of 1024). So, I'd like to ask how you operate it. Thank you very much for your answer. I will follow your research.
Dear ziqi-d,
for those sequences that were too long ( > 1022 amino acids), we followed the procedure of the ESM-1b paper and we simply only used the first 1022 amino acids of the sequence.
Dear Alexander Kroll, I ran the following command line
python extract.py esm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps --repr_layers 33 --include mean
To calculate the ESM-1b vectors now i have a folder
all_sequences_esm1b_reps
contain all .pt files. i have doubt how to merge them into the single file namedall_sequences_enz_sub.pt
. would you please share the code which you had done this step. Kind regards Vahid Atabaigielmito run: wget -c https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt
python extract.py esm1b_t33_650M_UR50S ./all_sequences.fasta ./output_dir --include mean
Dear @xukaili I hope this message finds you well. In reference to the command line you provided for extracting ESM-1b vectors, I have successfully obtained individual representations stored as .pt files in the directory all_sequences_esm1b_reps. Now, I need assistance in merging these individual files into a single file named all_sequences_enz_sub.pt.
Dear atabaigi,
here is code that you could use to merge these files into a single pt file:
import torch
import os
from os.path import join
import sys
PT_DIR = "/your/directory/with/all/pt/files"
SAVE_DIR = "/dir/to/save/new/file"
new_dict = {}
all_files = os.listdir(PT_DIR )
for file in all_files:
try:
rep = torch.load(join(PT_DIR , file))
new_dict[file] = rep["mean_representations"][33].numpy()
except:
print(file)
torch.save(new_dict, join(SAVE_DIR , "all_sequences_enz_sub.pt"))
I hope this helps
Dear Alexander Kroll Thank you very much for your work. I encountered the following problems in the process of reproducing your article. First of all, when I used the data set in your program to input into the esm-1b model, I found that many protein amino acid sequences in the data set were too long, which exceeded the dimension required by esm-1b (value error: sequence length 1657 about maximum sequence length of 1024). So, I'd like to ask how you operate it. Thank you very much for your answer. I will follow your research.
Dear ziqi-d,
for those sequences that were too long ( > 1022 amino acids), we followed the procedure of the ESM-1b paper and we simply only used the first 1022 amino acids of the sequence.
Dear Alexander Kroll, Thank you very much for your reply.
When I tried to load data from "data/splits/df_train_with_ESM1b_ts.pkl" , I got the "UnpicklingError: invalid load key, 'v'." And I checked the size of the file, it seems that the pkl file is not complete. Can you provide the complete data file? Thanks very much.