The datasets regarding how to train models

ZoeLct commented 9 months ago

Hello author, regarding how to train models using a custom protein database, I would like to ask, apart from the plasmid dataset , what datasets need to be prepared? Could you tell me what datasets you have used in your work?

HubertTang commented 9 months ago

Hi ZoeLct,

Since my tool is used to identify plasmids, my dataset contains plasmids and chromosomes, the former being positive samples and the latter being negative samples.

Best, Xubo

ZoeLct commented 9 months ago

Hello Xubo, I want to quickly use my own dataset to train a model, but there are some things I don't quite understand, and I hope you can help me: In the file train_pc_model, there are some paths that I don't quite understand:

ref_dir = f"path/to/ref_dir" # the directory for storing references ref_proteins = f"path/to/ref_protein.faa" # the path of reference proteins

What kind of files does ref_dir store? How is this ref_protein.faa generated? Is it generated by using Prodigal to generate .faa files from all plasmids in PLSDB?

val_pos_path = f"path/to/val_pos.fna" # the path of positive validation set val_pos_data_dir = f"path/to/val_pos" # the directory for storing positive validation data val_neg_path = f"path/to/val_neg.fna" # the path of negative validation set val_neg_data_dir = f"path/to/val_neg" # the directory for storing negative validation data

In addition, a validation set is used here. Since you did not mention the validation set in your paper, only the training set and test set, I don't quite understand how this validation set should be divided?

model_path = f"path/to/model.pt" # the path of trained model

Is the model here referring to the file trans_model.py? Why is it .pt here, can you tell me?

test_path = f"path/to/test.fna" # the path of testing set test_data_dir = f"path/to/test" # the directory for storing testing data

Why doesn't test.fna divide positive and negative samples?

I realize I've posed quite a few questions, but I would greatly appreciate any insights you could provide.

HubertTang commented 9 months ago

Hi ZoeLct,

I have just revised the part of the script regarding setting parameters and paths to make it clearer and more precise, please take a look.

What kind of files does ref_dir store?

This folder is designated for storing intermediate files generated by the script related to the reference database. You no longer need to set it manually.

How is this ref_protein.faa generated? Is it generated by using Prodigal to generate .faa files from all plasmids in PLSDB?

Yes.

Since you did not mention the validation set in your paper, only the training set and test set, I don't quite understand how this validation set should be divided?

The validation set in the setting is as a convenience for users who may wish to use validation sets to prevent overfitting. Initially, during my model training, I allocated 20% of the training data as the validation set. However, later on, I discovered that due to the substantial presence of oversampled data in the training set, setting the epoch value to 2 or 3 was adequate for training a well-performing model without overfitting. Consequently, during the final model training, I set the epoch value to 2 or 3 and did not set validation sets.

Is the model here referring to the file trans_model.py? Why is it .pt here, can you tell me?

It doesn't refer to the trans_model.py. model_path is the path to save the trained model.

Why doesn't test.fna divide positive and negative samples?

Here I just want to show how to make a prediction using your own testing data. You can add a function here to evaluate the performance if you know the ground truth.

Best, Xubo

ZoeLct commented 8 months ago

Hello Xubo, After I have run the file 'train_pc_model.py', can I directly use the command 'python PLASMe.py' for the classification task, or is there any other operation needed? Additionally, if I wish to make modifications to the transformer model, is it correct that after making the changes, I should run 'train_pc_model.py' again, and then use the command 'python PLASMe.py' to classify using the modified model?

HubertTang commented 8 months ago

Hi ZoeLct,

You cannot run PLASMe.py directly after retraining your custom model. PLASMe does not currently support custom databases. This is because custom databases require redefining many parameters, such as the number of protein clusters, the possible regions shared between the reference plasmids and chromosomes, different thresholds for the alignment of different PCs, and plasmid taxonomy classification, ... ... train_pc_model.py is just a demonstration for interested readers on how to train a PC-based Transformer model from scratch, not for retraining PLASMe's database. I will organize and upload the relevant code when I have time in the future. If you have a need to update the database, please let me know, and I will update it as soon as possible.

Best, Xubo

ZoeLct commented 8 months ago

Hello Xubo, I am wondering if it is possible to retrain the model by only modifying the transformer model, without changing the database, and using your existing database DB? If possible, how could I train the model and use PLASMe.py? Looking forward to your reply.

HubertTang commented 7 months ago

Hi ZoeLct,

Sorry for the late reply. You can use the training script to train models for each order separately. After that, replace the old models in the database with the newly trained ones. Then, you can use PLASMe.py to make predictions with your newly trained models.

Best, Xubo

ZoeLct commented 7 months ago

Hello Xubo,

I'm delighted to have received your response.

I would like to inquire if it would be possible for me to use the .p2a files from your database, and to omit the code segment in the train_pc_model script that generates p2a files, thereby directly using your provided train_pc_model script to train a new transformer model. Would such an approach be feasible? Are there any modifications that need to be made?If possible, I would like to train a new transformer model as soon as possible.

Additionally, I'm somewhat unclear about what you mean by "for each order separately." I've noticed that you have used different models, but I'm uncertain about the distinctions in the training processes for these models. If a unified model were to be used for classification, would there be a significant difference in performance?

Looking forward to your reply.

HubertTang / PLASMe

The datasets regarding how to train models #8