HubertTang / PLASMe

21 stars 4 forks source link

Understanding the ouput #2

Open Zick007 opened 1 year ago

Zick007 commented 1 year ago

Hi!

Thank you for developing such an interesting tool to identify plasmids. So I did use the example's settings to perform a first analysis of the assembled contigs fasta file I have and I took a look at the generated output file which looks pretty similar to the ones in the examples. However, I would like to fully understand the output.

Are the sequences in the output file plasmid sequences? If so, they aligned to which plasmd from the DB? When I Blast these sequences, I obtain "genome" sequences and not "plasmid" of any kind...

I also have different overlap region values for the same contigs exactly like the previous issue listed here. Is that normal? Finally, one of the step mentions predicting proteins from contigs. Are those proteins from plasmid contigs and where are those proteins listed?

I have copied my output file to this post. Thank you a lot in advance for your reply and for this new tool (:

test_plasme2023.txt

HubertTang commented 11 months ago

Hi Zick,

Sorry for the late reply. Thank you for providing the test files. Here, I will answer your questions one by one.

  1. Are the sequences in the output file plasmid sequences? Yes, all the output sequences are predicted plasmids.

  2. If so, they aligned to which plasmid from the DB? Some sequences are predicted based on the alignment result, and some are predicted via Transformer. I will add a function to generate a detailed description of the output later, including the ID of the aligned plasmid, the predicted score of the contig predicted as a plasmid, etc. ...

  3. When I Blast these sequences, I obtain "genome" sequences and not "plasmid" of any kind... Although I haven't set a specific length threshold for the input, I highly recommend inputting contigs longer than 1K, as they yield more reliable prediction results. One reason for this is the presence of high-similarity regions between plasmids and chromosomes. If the test contigs originate from these regions and are also very short, distinguishing whether they belong to the plasmid or the chromosome becomes exceedingly challenging. Additionally, short contigs may encode only one protein or no proteins at all, posing difficulties for the protein-token-based Transformer model's prediction. In the provided file, there are three contigs longer than 1K: contig39, contig40, and contig41. I aligned these contigs to the plasmid databases, and the following is the BLASTN alignment results (blastn.csv is the complete alignment result):

contig39    NZ_LR135270.1   95.403  1936    75  11  1   1928    447557  445628  0.0 3070
contig40    NZ_CP091218.1   99.075  1514    14  0   1   1514    124177  125690  0.0 2719
contig41    NZ_CP116511.1   95.992  1472    58  1   1   1471    305881  307352  0.0 2390

I also aligned these contigs to the 'nt' database, and interestingly, some of them showed higher bit-scores when aligned to the chromosomes. This observation suggests that these plasmids might originate from transposons shared by both plasmids and chromosomes. Thank you for bringing this issue to my attention; I realize the need to provide more detailed output to explain such results to users.

  1. I also have different overlap region values for the same contigs exactly like the previous issue listed here. Is that normal? It's bug should be fixed. Thanks for pointing it out, I will check the scripts.

  2. Are those proteins from plasmid contigs and where are those proteins listed? Yes, the predicted proteins are from the plasmid contigs. In the next version, I will provide the predicted proteins for users.

Please let me know if you have any other questions. I will update the scripts and fix the bugs ASAP and let you know after I finish.

Best