Open rajwanir opened 3 years ago
Hello, Rahim! Thanks for your feedback! Formally, yes, in your case the final score is so high due to only two serine monomers because such good matches are not expected to be observed randomly in general. 1st and 4th rows in the PSSM do not contain any significant outliers, so all substrates in these positions are considered by the model as roughly equally possible and do not influence the final score dramatically. In general, the BioCAT final score was designed to be more sensitive than specific, thus, false positive hits can be observed. An intuitive definition of the final score is the probability that a given BGC is responsible for synthesizing a given NRP rather than not. In your case, due to two strong matches and two weak mismatches, this probability is definitely above zero. We should mention that A-domain specificity values predicted for ‘orn’ can be controversial in some cases due to this monomer was found almost only in Streptomyces produced NRPs, so, the profile HMM and the negative background for this monomer have some bias caused by the taxon-specific overfitting. Thus, if your organism of interest does not belong to Streptomyces, ‘orn’ specificity predictions might be not consistent.
If you want, you can additionally inspect the results in the following ways:
Thank you for using our program, Danil P.S. Thanks for your helpful suggestion, we will consider adding genome-predicted monomeric sequences to the result file in addition to linearized rBAN predictions
Hi,
Thanks for this amazing tool. I think it's pretty useful in searching hypothesized non-ribosomal peptides.
I have some confusion in the result interpretation. In the output, it has a column for 'putative linearized NRP sequence' which corresponds to monomeric linearized version of query peptide derived from rBAN. Since the genome predicted NRP may not be an exact match to this linearized query, it might be useful to include a column for what was used as a genome predicted NRP for alignment. Below is an example output for your reference:
Chromosome ID Coordinates of cluster Strand Substance BGC ID Putative linearized NRP sequence Biosynthesis profile Sln score Mln score Sdn score Mdn score Sdt score Mdt score Slt score Mlt score Relative score Binary tig00000001 [1456645:1572935] - NPA022110 BGC_cand_3_1 orn--ser--ser--orn--nan Type A 1 0.94 1 0.98 1 0.9 1 0.97 0.967 1
In the above example, antiSMASH/NRPSPredictor2 predicted monomers in captioned BGC (BGC_cand_3_1) were (ctg1_1495: X, ctg1_1499: X, ctg1_1500: ser, ctg1_1504: ser, ctg1_1507: X ). The week predictions for ctg1_1495, ctg1_1499 and ctg1_1507 were (asp, asn, glu, gln, aad), (hydrophobic-aliphatic) and (dhpg, hpg) respectively. I see that based on two consecutive Ser in a peptide of 5 monomers it is a decent match.
BioCAT PSSM output for the BGC_cand_3_1 is below:
Module name tyr ala dhpg val glu orn gly hpg asp phe gln ser ile thr bht cys leu asn pro nrpspksdomains_ctg1_1507_AMP-binding.1 0.537259615 0.58808933 0.485008818 0.434500649 0.394242068 0.415716857 0.548672566 0.442642643 0.311602871 0.619939577 0.397076023 0.37653127 0.368231047 0.386792453 0.6039953 0.337223587 0.513598988 0.296454768 0.543891958 nrpspksdomains_ctg1_1504_AMP-binding.1 0.037860577 0.035359801 0.032333921 0.031128405 0.055816686 0.047990402 0.041087231 0.03963964 0.042464115 0.04592145 0.161988304 0.754996776 0.029482551 0.041778976 0.034077556 0.028255528 0.044908286 0.034841076 0.042971148 nrpspksdomains_ctg1_1500_AMP-binding.1 0.358173077 0.360421836 0.248677249 0.18612192 0.66039953 0.49430114 0.478508217 0.282282282 0.324760766 0.324471299 0.838011696 0.994197292 0.163056558 0.276280323 0.337250294 0.274570025 0.322580645 0.240831296 0.310006139 nrpspksdomains_ctg1_1499_AMP-binding.1 0.022235577 0.022952854 0.037624927 0.022697795 0.022914219 0.024595081 0.024020228 0.025825826 0.020933014 0.023564955 0.022807018 0.026434558 0.022864019 0.024932615 0.02173913 0.022727273 0.024035421 0.023227384 0.023327195 nrpspksdomains_ctg1_1495_AMP-binding.1 0.010817308 0.00248139 0.016460905 0.004539559 0.017626322 0.00239952 0.022123894 0.009009009 0.005382775 0.012084592 0.016374269 0.012250161 0.019253911 0.005390836 0.019388954 0.018427518 0.006957622 0.01405868 0.005524862
In the PSSM too, ctg1_1504 and ctg1_1500 has a high score for Ser, consistent with linearized peptide and antiSMASH. Both antiSMAH and biocat score suggest Dhpg for ctg1_1499, however, one might expect Orn based on the query. All A-domains show very low scores for Orn.
In this example, the high relative score (0.967) is likely driven by the positions of the two serine and total number of monomers in the peptide under-appreciating that only 2 out of 5 monomers match? Any other way that you would suggest to inspect results?
Thanks, Rahim.