Open Zick007 opened 1 year ago
Hi Zick,
Sorry for the late reply. Thank you for providing the test files. Here, I will answer your questions one by one.
Are the sequences in the output file plasmid sequences? Yes, all the output sequences are predicted plasmids.
If so, they aligned to which plasmid from the DB? Some sequences are predicted based on the alignment result, and some are predicted via Transformer. I will add a function to generate a detailed description of the output later, including the ID of the aligned plasmid, the predicted score of the contig predicted as a plasmid, etc. ...
When I Blast these sequences, I obtain "genome" sequences and not "plasmid" of any kind... Although I haven't set a specific length threshold for the input, I highly recommend inputting contigs longer than 1K, as they yield more reliable prediction results. One reason for this is the presence of high-similarity regions between plasmids and chromosomes. If the test contigs originate from these regions and are also very short, distinguishing whether they belong to the plasmid or the chromosome becomes exceedingly challenging. Additionally, short contigs may encode only one protein or no proteins at all, posing difficulties for the protein-token-based Transformer model's prediction. In the provided file, there are three contigs longer than 1K: contig39, contig40, and contig41. I aligned these contigs to the plasmid databases, and the following is the BLASTN alignment results (blastn.csv is the complete alignment result):
contig39 NZ_LR135270.1 95.403 1936 75 11 1 1928 447557 445628 0.0 3070
contig40 NZ_CP091218.1 99.075 1514 14 0 1 1514 124177 125690 0.0 2719
contig41 NZ_CP116511.1 95.992 1472 58 1 1 1471 305881 307352 0.0 2390
I also aligned these contigs to the 'nt' database, and interestingly, some of them showed higher bit-scores when aligned to the chromosomes. This observation suggests that these plasmids might originate from transposons shared by both plasmids and chromosomes. Thank you for bringing this issue to my attention; I realize the need to provide more detailed output to explain such results to users.
I also have different overlap region values for the same contigs exactly like the previous issue listed here. Is that normal? It's bug should be fixed. Thanks for pointing it out, I will check the scripts.
Are those proteins from plasmid contigs and where are those proteins listed? Yes, the predicted proteins are from the plasmid contigs. In the next version, I will provide the predicted proteins for users.
Please let me know if you have any other questions. I will update the scripts and fix the bugs ASAP and let you know after I finish.
Best
Hi!
Thank you for developing such an interesting tool to identify plasmids. So I did use the example's settings to perform a first analysis of the assembled contigs fasta file I have and I took a look at the generated output file which looks pretty similar to the ones in the examples. However, I would like to fully understand the output.
Are the sequences in the output file plasmid sequences? If so, they aligned to which plasmd from the DB? When I Blast these sequences, I obtain "genome" sequences and not "plasmid" of any kind...
I also have different overlap region values for the same contigs exactly like the previous issue listed here. Is that normal? Finally, one of the step mentions predicting proteins from contigs. Are those proteins from plasmid contigs and where are those proteins listed?
I have copied my output file to this post. Thank you a lot in advance for your reply and for this new tool (:
test_plasme2023.txt