HubertTang / PLASMe

18 stars 3 forks source link

I wonder how to train the model using a new dataset to adapt to different datasets? #5

Open a-piece-of-teemo opened 7 months ago

HubertTang commented 7 months ago

Hi,

Sorry for the late reply.

I just released the train_pc_model.py script, which demonstrates how to train a PC-based Transformer model using a customized database. This script provides instructions on generating the PC database using a customized protein database, converting sequences into numerical vectors, and training and predicting with the model. If you have any further questions regarding this, please feel free to let me know.

Best, Xubo

a-piece-of-teemo commented 7 months ago

Dear Xubo, Firstly, I would like to express my sincere gratitude for your response and for providing the train_pc_model.py script. It has been incredibly helpful in understanding how to train a PC-based Transformer model using a custom database. Building on this, I have a few questions that I hope you could help me with: In addition to the PC-based code, do you also have code related to the nt-bpe-based model training scripts? If so, could you possibly share the relevant scripts? Furthermore, in your train_pc_model.py, you mentioned train_pos_data and train_neg_data. Could you kindly explain what these data sets refer to? How are they distinguished as 'pos' or 'neg'? I eagerly await your response and once again, thank you for your assistance!

HubertTang commented 7 months ago

Hi,

In addition to the PC-based code, do you also have code related to the nt-bpe-based model training scripts? If so, could you possibly share the relevant scripts?

I wrote two scripts to demonstrate how to train and tokenize using the BPE model (BPE.txt) and train a BPE-based Transformer (BPE_Transformer.txt). Please note the following instructions:

  1. To train the BPE model, you will need to install sentencepiece.
  2. The input files for BPE_Transformer should be in numpy array format. Therefore, after obtaining the tokenized sentence file using BPE (in text format), you need to save it as a numpy array file. This numpy array file can then be loaded by BPE_Transformer. Each row in the numpy array file represents a 'sentence' and should be padded with zeros.

Could you kindly explain what these data sets refer to? How are they distinguished as 'pos' or 'neg'?

In this example training script, I demonstrated a binary classification task where "pos" represents positive samples and "neg" represents negative samples. You can modify the input data and loss function according to your task.

Best, Xubo

a-piece-of-teemo commented 6 months ago

Dear Xubo, I am delighted to have your assistance. However, I am encountering an issue when trying to import the library function 'from model import Transformer' in the file named BPE_Transformer.txt. I am unable to import this module. Could you please clarify what the 'model' refers to in this file? I am eagerly looking forward to your help.

HubertTang commented 6 months ago

The 'model' module is trans_model.py.

a-piece-of-teemo commented 6 months ago

Dear Xubo, Sorry to bother you again, but I have a question about the BPE file. For the 'train_fasta', should I be using plasmid data for training? Is it necessary to use chromosome sequences for training here? From your paper, it appears that you used sampled sequences to train the nt-bpe model. Does this imply that I should use both the complete sequences and the sampled sequences as 'train_fasta'? I am grateful for all the assistance you have provided me so far, and I look forward to your guidance on this matter.

HubertTang commented 6 months ago

Hi,

For the 'train_fasta', should I be using plasmid data for training? Is it necessary to use chromosome sequences for training here?

I only used plasmid data for training.

Does this imply that I should use both the complete sequences and the sampled sequences as 'train_fasta'?

No, only use complete sequences to train the BPE model. The sampled subsequences are used to train the Transformer model.

Best, Xubo

a-piece-of-teemo commented 6 months ago
  1. The input files for BPE_Transformer should be in numpy array format. Therefore, after obtaining the tokenized sentence file using BPE (in text format), you need to save it as a numpy array file. This numpy array file can then be loaded by BPE_Transformer. Each row in the numpy array file represents a 'sentence' and should be padded with zeros.

Hello, regarding the conversion to numpy and padding with zeros, I would like to ask whether the sequence is converted to a fixed length or based on the length of the longest row in the array.

HubertTang commented 6 months ago

Hi,

The padding length is based on the length of the sentence input to the Transformer. For example, if you set the input length to Transformer is 300, and your tokenized sentence is 200, you need to padding 100 zeros after the sentence.

Best, Xubo

a-piece-of-teemo commented 6 months ago

Hello, Sorry to bother you, I have some questions. This is my understanding of the steps for using the nt-bpe model for classification. Please point out any mistakes I might have made: 1、Generate an nt-bpe vocabulary using the complete plasmid sequence. 2、Cut the input training sequence into 1500bp segments, generate tokens using the nt-bpe model, and input them into the transformer model. I have a few questions. First, in the second step, is it necessary to use a sliding window to sample the training sequence for training, or should I just use a fixed 1500bp? Second, after the model is trained, when testing the effect with new sequences, should my input sequence be complete or should it be divided into 1500bp segments? Third, I noticed that when you use the nt-bpe model, you also use aa-bpe, aa, and nt-bpe majority voting for classification. I would like to know if using only the nt-bpe model would significantly affect the results? I am eagerly looking forward to your help.

HubertTang commented 6 months ago

Hi,

1、Generate an nt-bpe vocabulary using the complete plasmid sequence. 2、Cut the input training sequence into 1500bp segments, generate tokens using the nt-bpe model, and input them into the transformer model.

The training steps are correct.

First, in the second step, is it necessary to use a sliding window to sample the training sequence for training, or should I just use a fixed 1500bp?

It depends on you, and you can try different sampling methods. I usually use the sliding window to ensure the sampling is unbiased.

Second, after the model is trained, when testing the effect with new sequences, should my input sequence be complete or should it be divided into 1500bp segments?

The sequence should be divided into fixed lengths because the length of the input to the transformer is fixed. Unless your model can handle long sequences, you don't need to divide them.

Third, I noticed that when you use the nt-bpe model, you also use aa-bpe, aa, and nt-bpe majority voting for classification. I would like to know if using only the nt-bpe model would significantly affect the results?

The performance of these three models is very similar. Furthermore, I applied majority voting separately for each model, rather than using majority voting on the results of all three models collectively.

Best, Xubo

a-piece-of-teemo commented 6 months ago

Hello, Sorry for troubling you again. In your paper, you mentioned that the training of the PC model involved using both sampled sequences and original sequences. I would like to know if the use of the nt-BPE model still requires the training with original sequences, or if only the sampled sequences are needed? Additionally, regarding the weight setting issue you mentioned in your paper to avoid the influence of data imbalance, I did not find it in the code. If I have overlooked it, please let me know where it is. Thank you for your assistance and I am looking forward to your reply.

HubertTang commented 6 months ago

Hi,

Sorry for the late reply.

I would like to know if the use of the nt-BPE model still requires the training with original sequences, or if only the sampled sequences are needed?

If the length of the input sequences is longer than the length of the Transformer input, you can only use the sampled sequences. Otherwise, you can use the original and sampled sequences.

Additionally, regarding the weight setting issue you mentioned in your paper to avoid the influence of data imbalance, I did not find it in the code. If I have overlooked it, please let me know where it is.

I didn't set the weights in the example training script because someone may not want to set weights during training. It's easy to set the weights, and you can refer here (https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html).

a-piece-of-teemo commented 6 months ago

Hello,

I apologize for disturbing you, but there is a question that has troubled me for a long time, and I have to consult you about it.

Regarding the nt-bpe model you provided, I would like to know how the input sequences are sampled during the model training process. I cannot understand how to split the input sequences into multiple segments, encode them through the nt-bpe model, and then use it to train the transformer model while still ensuring the integrity of the labels and sequences. Moreover, after training the transformer model, how to split new test data into several segments for classification, and how to ensure that the results of the segments are consistent with the combined results? This issue has troubled me for a long time. I would like to ask how you did it, and if possible, could you provide the code of how you handle it?

Additionally, do you have any code related to training scripts for models based on AA and aa-bpe? If so, could you share the relevant scripts?

I deeply apologize for the interruption and hope to receive your assistance.

HubertTang commented 6 months ago

Hi,

Here, I provide a simple example to demonstrate the process of training and predicting.

Here are the tokenized datasets, and each character represents a word: Training: DAHFLSJLDJLFKJ Testing: QWEJRHWHER

Assuming the set input length of the Transformer is 5, we need to divide the training sentence into shorter sub-sentences first. I can use a 5-word window with a stride of 3 to sample the training sentence to obtain the training data: DAHFL, HFLSJ, SJLDJ, DJLFKJ. Then you used these sampled sub-sentences to train the model.

After we obtain the trained model, we find the testing sentence is too long to predict. Then we use the window to sample the testing sentence to obtain the sub-sentences: QWEJR, JRHWH, HWHER. Next, we make predictions for each sub-sentence, and I get P(QWEJR) = 0, P(JRHWH) = 0, and P(HWHER) = 1. We observe that the majority of sub-sentences are predicted as 0. Based on the majority voting principle, we consider the label of the testing data to be 0.

Here I draw a plot to help you to understand the tokenization and prediction using nt-BPE, aa-BPE, aa: Prediction

I don’t know if the above explanation helps you understand the training process. Feel free to ask if you still have questions. If you want to train aa-bpe model, you can input the proteins into the (BPE.txt) and the training process is very similar to the nt-BPE. The difference is that the nt-BPE model make predictions for the contigs using sub-sequences, but aa-BPE model make predictions using the encoded proteins. aa-based model is just using the single amino acid as the token. You can use a amino acid vocabulary in BPE_Transformer.txt to train the aa-based model.

Best, Xubo

a-piece-of-teemo commented 6 months ago

Hello, Through your explanation, I have clearly understood the principles involved. However, when I searched online for related code, I couldn't find anything relevant and suitable. Would it be convenient for you to share the code you used regarding sliding window sampling and voting? Thank you for your assistance and I am looking forward to your reply.

HubertTang commented 5 months ago

Hi,

Because the final version of PLASMe only used the PC-based models, I didn't upload the codes related to other types of models. It's been too long, and I forgot where to put the relevant codes, here I rewrote the codes related to sampling and voting you can refer to https://portland-my.sharepoint.com/:u:/g/personal/xubotang2-c_my_cityu_edu_hk/EWM-XxsjChFJr0SdzPrksvQBQGvdxFiuTpwKUFKbgYbHGQ?e=HOjZw7.

a-piece-of-teemo commented 5 months ago

Hello, I am reaching out to you with the intention of modifying the model based on the PC tokenizer, using your model as a foundation, to create a model based on the nt-bpe tokenizer to fulfill my requirements. Throughout our communication, I have received a great deal of assistance from you. However, my personal capabilities are limited, and I feel that I am unable to complete my objective on time, so I am seeking your help. I would like to inquire if there is a transformer model for classification that uses the nt-bpe tokenizer, and if there are any more complete scripts available. Given my current progress, I feel that completing my task is very difficult, and therefore, I hope to receive your assistance. Thank you for your assistance and I am looking forward to your reply.