clb21565 / mobileOG-db

code repo for mobileOG-db
GNU General Public License v3.0
32 stars 5 forks source link

Output of mobileOG #9

Closed asierFernandezP closed 1 year ago

asierFernandezP commented 1 year ago

Hi,

Thanks for the great database and the scripts provided.

I have a few general questions about the output files and recommendations on how to interpret them:

  1. First, would you recommend to use the full database or only the version containing manually curated + homologue sequences? Is the classification of the remaining proteins reliable (keyword data)? I have tried using your tool in a few contigs using the curated + homology DB and some proteins are given a NA as mobileOG Category (even in some manually curated sequences). Why does this happen?

  2. Also, I would like to have a more detailed explanation of the output files. E.g. contig_file_summary.csv: I do not completely understand the output of this file. I would expect that each row corresponds to one contig (although contig names are not displayed), but in my case there are less rows than contigs.

  3. From a list of potential phage/viral contigs, I am interested in determining which of these contigs could be potential plasmids and mobile elements to discard them, as I want to keep only phage sequences. Which annotations (or how many) should be present in a contig to confidently classify it as a mobile element or as a plasmid?

Thank you, Asier

clb21565 commented 1 year ago

Thanks for the awesome questions!

  1. I would recommend using the full database, especially if your target is for phages, for which many proteins are unlikely to be well fleshed out in the literature. Regarding mobileOG categories, each sequence has a major category and a minor category (hierarchical categorization). The major category should never ever be NA, so if you see that, that's an error, and please let us know. The minor category might be NA if there is no secondary functional classification ascribed to it. More info on the headers can be found here: https://fralinlifesci.vt.edu/content/dam/fralinlifesci_vt_edu/ciwars/mobileOG-db_UserGuidance_v1.6.pdf
  2. The summary file is a legacy output we've meant to remove or modify. Kind of slipped through the cracks-- so-- thanks for reminding us. The most useful output to date should be the alignment summaries.
  3. Great question! I've found that contigs with multiple mobileOGs corresponding to the same element type will be pretty good indicators of an MGE. So, finding a contig with >1 plasmid, IGE, or IS hit would help there.

For annotating phages, my recommendation would be to use low identity value parameters (e.g., --id 30) as phage proteins often show poor conservation at the aa level. Doing this, you should be able to find contigs where more than one phage hit has occurred. You could additionally filter contigs to remove those with hits to other element types as described above. The metadata linked here might be of help in this. See the pipeline homepage for more details.

Please do let me know if you have follow-up questions, or if we can provide any additional scripting. this will help us make these kinds of analyses more accessible.

Connor

asierFernandezP commented 1 year ago

Thanks for your help!

TomasaSbaffi commented 1 year ago

Hello, thanks for this amazing and useful tool! I just wanted to say that I wish these links were on the main page, the UsageGuidance.md.

More info on the headers can be found here: https://fralinlifesci.vt.edu/content/dam/fralinlifesci_vt_edu/ciwars/mobileOG-db_UserGuidance_v1.6.pdf

The metadata linked here might be of help in this. See the pipeline homepage for more details.

thanks again for your help!