Some questions about your tools and paper

basehc commented 5 months ago

Dear Tang,

Thank you very much for your work. I am particularly interested in the training and testing details of your study, as presented in your paper, which I have read carefully. I have some questions, and I hope you can provide answers. Thank you!

I would like to know how many classes are in your model. For instance, in your sections 'Performance on the real metagenomic and plasmidome datasets,' 'Performance on the simulated CAMI2 metagenomic datasets,' 'Performance on the contig test set,' and 'Performance on the PLSDB test set,' it appears you have implemented a binary classification model(differentiate plasmids from chromosomes). It seem like there is no order-specific model here ？
How do you view the contributions of the blastn tools versus the transformer in terms of performance on the real metagenomic and plasmidome datasets, the simulated CAMI2 metagenomic datasets, the contig test set, and the PLSDB test set in your main paper?
In your supplementary material, Figure S3 displays the results of the order-specific transformer model. In this case, what constitutes the negative samples, if the positive samples represent a single order? Do negative samples consist of chromosomes and other orders? How did you address the issue of imbalanced learning due to sample imbalance in this situation?
In the supplementary section 'Comparing the performance and computational costs between order-specific and unified models,' could you elaborate on the unified models? It is not explained in the main paper. How many classes (consisting of who and who) does this unified model differentiate?

basehc commented 5 months ago

Another confusion is the use of BLASTn. 1: I observe that your tool incorporates BLASTn. However, in the article at https://academic.oup.com/nar/article/51/15/e83/7222081?login=true#414925307, it states, "First, all training sequences that are identical to the test sequences are removed. To get the labels of the assembled contigs, we aligned the contigs to the references provided by CAMI2 using BLASTN." Yet, you used BLASTn to label the sequences before employing your tool for prediction, which may seem somewhat unfair. 2: this confusion happen to Figure 6, where you have again compared with BLASTn, despite your tool including BLASTn.

HubertTang commented 4 months ago

Hi basehc,

Thank you for your interest in PLASMe. Below are the answers to your questions.

For each Transformer model, we utilized a binary classifier to distinguish between plasmids and chromosomes. Apart from the mention of using a unified model, all other sections employed an order-specific model.
Our tool consists of two steps. Firstly, we employ BLASTn to align sequences against the plasmid database and identify sequences that meet the specified threshold criteria as plasmids. And we typically set a higher threshold (default: identity 90%, coverage 90%) for BLASTn. Subsequently, the remaining sequences that do not meet the threshold are inputted into the Transformer for prediction. Therefore, if most plasmids in the test data are very similar to the sequences in the reference dataset, most plasmids will be predicted by BLASTn, indicating a greater contribution from BLASTn. Conversely, if the plasmids in the test data differ from the reference dataset, the prediction relies more on the Transformer.
The negative samples are chromosomes and do not include plasmids from other orders. To address the challenge of imbalance, we assigned greater weights to the small class in the loss function during training.
The unified model refers to training a Transformer using plasmids and chromosomes from all orders. Hence, the key distinction from the order-specific model is that it does not require differentiating the sequences based on their orders. It's still a binary classification model, with plasmids as the positive class and chromosomes as the negative class.
Regarding your concerns about the use of BLASTn, in the CAMI dataset, BLASTN was employed to assign labels to contigs with unknown labels. However, we only assigned labels when the identity and coverage of the contigs aligned to the reference exceeded 80%. Our tool, PLASMe, also includes BLASTN, but as described in the paper, the threshold for identity and coverage in PLASMe's BLASTN is set at 90%. This means that only a part of the test sequences that are very similar to the plasmids in the reference will be predicted by PLASMe‘s BLASTN, while the remaining sequences will still be predicted using the Transformer model. In the experiment presented in Figure 6, we aimed to demonstrate the limited performance of BLASTn in PLASMe for this dataset, thereby highlighting the superior performance of the Transformer model.

Feel free to ask if you still have questions.

Best, Xubo

HubertTang / PLASMe

Some questions about your tools and paper #10