AlexanderKroll / ESP

MIT License
67 stars 23 forks source link

Missing of training data #4

Closed wujialu closed 11 months ago

wujialu commented 1 year ago

When I tried to load data from "data/splits/df_train_with_ESM1b_ts.pkl" , I got the "UnpicklingError: invalid load key, 'v'." And I checked the size of the file, it seems that the pkl file is not complete. Can you provide the complete data file? Thanks very much.

AlexanderKroll commented 1 year ago

Dear wujialu,

I downloaded the file "data/splits/df_train_with_ESM1b_ts.pkl" manually from the GitHub repository and I could open it wiothout any error. The file size (~100MB) seems also fine. Maybe there was an error while you downloaded the file? Can you try to manually download the file from the GitHub repository and replace the file that you downloaded before?

AlexanderKroll commented 1 year ago

Dear luohuiqian,

I assume you are referring to the datasets_PubChem variable in "ESP/notebooks_and_code/additional_code/data_preprocessing.py"? If yes, there was indeed a missing folder. But I adjusted the data_preprocessing.py code some weeks ago and I uploaded an additional folder to zenodo (https://doi.org/10.5281/zenodo.8016269) with the missing files (see also Readme page of this repo and issue3: https://github.com/AlexanderKroll/ESP/issues/3 i hope this helps

luohuiqian commented 1 year ago

Dear luohuiqian,

I assume you are referring to the datasets_PubChem variable in "ESP/notebooks_and_code/additional_code/data_preprocessing.py"? If yes, there was indeed a missing folder. But I adjusted the data_preprocessing.py code some weeks ago and I uploaded an additional folder to zenodo (https://doi.org/10.5281/zenodo.8016269) with the missing files (see also Readme page of this repo and issue3: #3 i hope this helps

Thank you for your help. But i still found something missing in your dataset. when i run the code"df_GO_UID = pd.read_csv(join(CURRENT_DIR, "alex_data", "df_GO_UID.csv"))", i can't find the file of "df_GO_UID.csv" in your dataset. If convenient, i wish you can upload the data soon.

luohuiqian commented 1 year ago

Dear Alexander Kroll

Thank you for your help. But i still found something missing in your dataset. when i run the code"df_GO_UID = pd.read_csv(join(CURRENT_DIR, "alex_data", "df_GO_UID.csv"))", i can't find the file of "df_GO_UID.csv" in your dataset. If convenient, i wish you can upload the data soon.

At 2023-07-05 18:09:48, "Alexander Kroll" @.***> wrote:

Dear luohuiqian,

I assume you are referring to the datasets_PubChem variable in "ESP/notebooks_and_code/additional_code/data_preprocessing.py"? If yes, there was indeed a missing folder. But I adjusted the data_preprocessing.py code some weeks ago and I uploaded an additional folder to zenodo (https://doi.org/10.5281/zenodo.8016269) with the missing files (see also Readme page of this repo and issue3: #3 i hope this helps

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

AlexanderKroll commented 1 year ago

Dear luohuiqian,

I searched in this GitHub repository, but I could not find the line "df_GO_UID = pd.read_csv(join(CURRENT_DIR, "alex_data", "df_GO_UID.csv"))". Can you tell my in which of the files this line of code is?

luohuiqian commented 1 year ago

Dear Alexander Kroll I'm very sorry, I seem to have made a mistake. I didn't find the error in the code mentioned above, but when I ran this code "rep_dict=torch. load (join (CURRENT_DIR,".. "," data "," enzyme_data "," all_sequences_enz_sub. pt ")", I found some missing data. I didn't find this file in your data folder.(The following figure shows the code you provided)

At 2023-07-18 15:25:22, "Alexander Kroll" @.***> wrote:

Dear luohuiqian,

I searched in this GitHub repository, but I could not find the line "df_GO_UID = pd.read_csv(join(CURRENT_DIR, "alex_data", "df_GO_UID.csv"))". Can you tell my in which of the files this line of code is?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

AlexanderKroll commented 1 year ago

Dear luohuiqian,

I did not upload this file, because of its large size. But as I write in the text above the cell, you have all data to create this file on your own:

" To calculate the ESM-1b vectors, we used the model and code provided by the Facebook Research team: https://github.com/facebookresearch/esm. The following command line was used to calculate the representations:

python extract.py esm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps--repr_layers 33 --include mean 

" If you should not manage to create the file on your own, please let me know. In this case, I can either help you or upload the file to some repository.

luohuiqian commented 1 year ago

I'm sorry, because the data and batch contained in the 'all_sequences. fasta' file are too large, it seems difficult for my computer to run the instruction code you provided. If possible, I hope you can provide me with this file directly. Thank you very much.

At 2023-07-26 18:57:23, "Alexander Kroll" @.***> wrote:

Dear luohuiqian,

I did not upload this file, because of its large size. But as I write in the text above the cell, you have all data to create this file on your own:

" To calculate the ESM-1b vectors, we used the model and code provided by the Facebook Research team: https://github.com/facebookresearch/esm. The following command line was used to calculate the representations:

pythonextract.pyesm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps--repr_layers33--includemean

" If you should not manage to create the file on your own, please let me know. In this case, I can either help you or upload the file to some repository.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

AlexanderKroll commented 1 year ago

Dear luohuiqian,

I will create and upload the file for you in the next few days, but can I first ask for which purpose you need those files? The code in this repository is meant for re-training the model from our paper, but if you have trouble to create the fasta or ESM-1b file, you might also not be able to reproduce all other necessary steps to repeat the analyses from the paper. If you are simply interested in using/applying the trained model, I created a separate repository that allows an easy use of the trained model in Python, which automatically creates the required representations for all enzymes: https://github.com/AlexanderKroll/ESP_prediction_function Alternatively, you can use our web server which you can directly run in any browser without any requirements: https://esp.cs.hhu.de/

EasternCaveMan commented 1 year ago

Dear Alexander Kroll,

I ran the following command line

python extract.py esm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps --repr_layers 33 --include mean 

To calculate the ESM-1b vectors now i have a folder all_sequences_esm1b_repscontain all .pt files. i have doubt how to merge them into the single file named all_sequences_enz_sub.pt. would you please share the code which you had done this step. Kind regards Vahid Atabaigielmi

xukaili commented 1 year ago

Dear Alexander Kroll,

I ran the following command line

python extract.py esm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps --repr_layers 33 --include mean 

To calculate the ESM-1b vectors now i have a folder all_sequences_esm1b_repscontain all .pt files. i have doubt how to merge them into the single file named all_sequences_enz_sub.pt. would you please share the code which you had done this step. Kind regards Vahid Atabaigielmi

to run: wget -c https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt

python extract.py esm1b_t33_650M_UR50S ./all_sequences.fasta ./output_dir --include mean
EasternCaveMan commented 1 year ago

Dear Alexander Kroll, I ran the following command line

python extract.py esm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps --repr_layers 33 --include mean 

To calculate the ESM-1b vectors now i have a folder all_sequences_esm1b_repscontain all .pt files. i have doubt how to merge them into the single file named all_sequences_enz_sub.pt. would you please share the code which you had done this step. Kind regards Vahid Atabaigielmi

to run: wget -c https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt

python extract.py esm1b_t33_650M_UR50S ./all_sequences.fasta ./output_dir --include mean

Dear @xukaili I hope this message finds you well. In reference to the command line you provided for extracting ESM-1b vectors, I have successfully obtained individual representations stored as .pt files in the directory all_sequences_esm1b_reps. Now, I need assistance in merging these individual files into a single file named all_sequences_enz_sub.pt.

Lake-D commented 1 year ago

Dear Alexander Kroll Thank you very much for your work. I encountered the following problems in the process of reproducing your article. First of all, when I used the data set in your program to input into the esm-1b model, I found that many protein amino acid sequences in the data set were too long, which exceeded the dimension required by esm-1b (value error: sequence length 1657 about maximum sequence length of 1024). So, I'd like to ask how you operate it. Thank you very much for your answer. I will follow your research.

AlexanderKroll commented 12 months ago

Dear Alexander Kroll Thank you very much for your work. I encountered the following problems in the process of reproducing your article. First of all, when I used the data set in your program to input into the esm-1b model, I found that many protein amino acid sequences in the data set were too long, which exceeded the dimension required by esm-1b (value error: sequence length 1657 about maximum sequence length of 1024). So, I'd like to ask how you operate it. Thank you very much for your answer. I will follow your research.

Dear ziqi-d,

for those sequences that were too long ( > 1022 amino acids), we followed the procedure of the ESM-1b paper and we simply only used the first 1022 amino acids of the sequence.

AlexanderKroll commented 12 months ago

Dear Alexander Kroll, I ran the following command line

python extract.py esm1b_t33_650M_UR50S \path_to_fasta_file\all_sequences.fasta \path_to_store_representations\all_sequences_esm1b_reps --repr_layers 33 --include mean 

To calculate the ESM-1b vectors now i have a folder all_sequences_esm1b_repscontain all .pt files. i have doubt how to merge them into the single file named all_sequences_enz_sub.pt. would you please share the code which you had done this step. Kind regards Vahid Atabaigielmi

to run: wget -c https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt

python extract.py esm1b_t33_650M_UR50S ./all_sequences.fasta ./output_dir --include mean

Dear @xukaili I hope this message finds you well. In reference to the command line you provided for extracting ESM-1b vectors, I have successfully obtained individual representations stored as .pt files in the directory all_sequences_esm1b_reps. Now, I need assistance in merging these individual files into a single file named all_sequences_enz_sub.pt.

Dear atabaigi,

here is code that you could use to merge these files into a single pt file:

import torch
import os
from os.path import join
import sys

PT_DIR = "/your/directory/with/all/pt/files"
SAVE_DIR = "/dir/to/save/new/file"

new_dict = {}
all_files = os.listdir(PT_DIR )

for file in all_files:
    try:
        rep = torch.load(join(PT_DIR , file))
        new_dict[file] = rep["mean_representations"][33].numpy()
    except:
        print(file)

torch.save(new_dict, join(SAVE_DIR , "all_sequences_enz_sub.pt"))

I hope this helps

Lake-D commented 12 months ago

Dear Alexander Kroll Thank you very much for your work. I encountered the following problems in the process of reproducing your article. First of all, when I used the data set in your program to input into the esm-1b model, I found that many protein amino acid sequences in the data set were too long, which exceeded the dimension required by esm-1b (value error: sequence length 1657 about maximum sequence length of 1024). So, I'd like to ask how you operate it. Thank you very much for your answer. I will follow your research.

Dear ziqi-d,

for those sequences that were too long ( > 1022 amino acids), we followed the procedure of the ESM-1b paper and we simply only used the first 1022 amino acids of the sequence.

Dear Alexander Kroll, Thank you very much for your reply.