Open shouldsee opened 7 years ago
I don't know the full answer but...
The crosshits protocol has been to do an all-vs-all PRC comparison of s-reps and then a SSAP of any pairs that get good-enough PRC result. Sure enough, though both 1vj0A02
and 1uwkA02
appear to have been included in the v4.1.0 lists, neither of these PRC results files makes mention of the other domain:
/cath/data/current/xhitsresults/v4_1_0/1vj0A02.PRC.results.xml
/cath/data/current/xhitsresults/v4_1_0/1uwkA02.PRC.results.xml
So the questions are:
We still have a few versions of the relevant models:
/cath/data/v4_0_0/hmm/1uwkA02.prc
/cath/data/current/hmm/1uwkA02.prc
/cath/data/v4_0_0/hmm/1vj0A02.prc
/cath/data/current/hmm/1vj0A02.prc
I think our PRC scans have often used commands like:
/opt/local/apps/linux/bin/prc/prc -algo viterbi some_prc_model.prc some_prc_library.lib
This all looks interesting @shouldsee - good work (and thanks for the info @tonyelewis)
For info, running PRC on the v4.0 models gives:
$ /opt/local/apps/linux/bin/prc/prc -algo viterbi /cath/data/v4_0_0/hmm/1uwkA02.prc /cath/data/v4_0_0/hmm/1vj0A02.prc
PRC 1.5.3 (PLAN9, SPACE5, ALL_TRANS), compiled on Mar 28 2007
Copyright (C) 2002-5 Martin Madera and MRC LMB, Cambridge, UK
Freely distributed under the GNU General Public License
Command : /opt/local/apps/linux/bin/prc/prc -algo viterbi /cath/data/v4_0_0/hmm/1uwkA02.prc /cath/data/v4_0_0/hmm/1vj0A02.prc
Algorithm : Viterbi
Match-match : dot2
Align mode : local-local
Alignments : none
Simple stop : 1.5
Max hits : 100
Started : Wed Sep 13 10:01:54 2017
# hmm1 start1 end1 length1 hit_no hmm2 start2 end2 length2 simple reverse
1uwkA02 25 90 208 1 1vj0A02 17 86 147 11.2 5.3
1uwkA02 177 201 208 2 1vj0A02 45 69 147 4.5 -1.5
1uwkA02 183 194 208 3 1vj0A02 30 41 147 3.7 -2.3
1uwkA02 32 71 208 4 1vj0A02 27 66 147 3.5 -2.4
1uwkA02 4 7 208 5 1vj0A02 22 25 147 3.5 -2.5
...
I'm not sure how these scores get processed by the xhits (i.e. cutoffs).
If we assume that this relationship wasn't in the xhits because it didn't meet some part of the criteria (rather than a bug), then it looks like this would be a candidate for a potential remote homology (which is precisely what ISS is trying to achieve).
ISS is meant to provide an alternative approach to homology detection - the objective is to identify homologous relationships that we haven't been able to spot with existing methods (i.e. strong links between domains currently in different superfamilies).
The question then is: how do we know that the putative relationship that you've pulled out is a "true" remote homologous relationship or a false positive? Ultimately - detective work (manual curation):
Generally speaking, we're looking for 2 of 3 of those criteria. @nataliedawson is our curator - she's busy this week but we can arrange a meeting next week to go through this if you're around.
Hi guys,
Thanks for your prompt responses!
I dug up a trac ticket 768 by @tonyelewis from 5 years ago. It seems SSAP=74.02 < 80, PRC_raw_score=11.2 probably translate to > 0.001, which might be causing the discarding. What do you think of this hypothesis @tonyelewis ?
........ We have come up with the following criteria for automatically assigning domain X to superfamily Y:
X has SSAP score >= 80% over an overlap >= 80% and a PRC evalue <= 0.001 with some s-rep in Y. X does not achieve this with an s-rep in any superfamily other than Y (perhaps to be safer, these cutoffs should be loosened a bit). Y is not in the list of xhits trouble-causers. To sanity-check this, we plotted SSAP scores vs PRC evalues for the data we've got for s-rep pairs in v3.5.0 (and where SSAP overlap >= 80%). I will attach the resulting graph (one version is zoomed in; the other is zoomed out). They look reassuring.
We eye-balled the false positives to check that they are xhits (and hence should be true positives). They also looked reassuring.
We're hoping to get moving on v3.6.0 ASAP so I'm hoping to implement this in time to run over the weekend.
@sillitoe Out of the three criteria you mentioned, I am particularly incapable in terms of identifying the "functional linkage" so probably would focus on the other two. ISS is ultimately exploiting the sequence information, whereas structural comparison would probably come in the validation process.
The result from my ISS searches still need a lot of sanitisation (most of them are nonsenses), thus I would probably go:
Lots to do! XD
I envision it taking the form of a binary to accept a hmm_profile, a hmm_lib and a sequence_lib, and output a hit list. The painful bit is to search hmm_lib against sequence_lib (using hmmsearch) and the result should probably be cached to hmmhits ("hmmserach hmm_lib sequence_lib > hmmhits") so that we can do " iss hmmprofile hmmhits > result " or "iss hmmprofile_hit hmmhits > result".
The implementation I current have is " iss hmmhits > result ", which is kind of weird because it lists every possible pair of hits within the hmm_lib, which I think is less intuitive to visualise, but runs faster than concatenating indivdiual query results.
BTW, I would really like to hear from @sillitoe your ideas on current documentation of master branch 0eb5b50d16a3b4131f0be289b31e5f41c6425c5a , which consists of old functionality mentioned in my last presentation.#
I am probably free next week. What time do you prefer? @sillitoe
Kind regards Feng
I dug up a trac ticket 768 by @tonyelewis from 5 years ago. It seems SSAP=74.02 < 80, PRC_raw_score=11.2 probably translate to > 0.001, which might be causing the discarding. What do you think of this hypothesis @tonyelewis ?
I think those are the criteria for post-scan-auto-assign. So if you're asking about post-scan-auto-assign, then: yes, given that the SSAP score fails the criteria then that would mean that the second domain that got assigned wouldn't have been post-scan-auto-assigned based on a match with the first (if that had been implemented at the time the second domain was being assessed).
If you're asking about crosshits, then: no, these criteria are irrelevant. These two were included in the v4.1 crosshits scans but had no PRC result, which means either that there was an error or that they result didn't meet the relevant threshold.
I think you may need to run PRC against a library to get an evalue.
@tonyelewis Can you plz show me the path to the PRC library? What do you think is the PRC evalue/pvalue threshold exactly?
@sillitoe I am using "http://update.cathdb.info/api/data/" to access "/cath/data/current/", but how can I access "/cath/data/v4_0_0/" to grab the old *.prc profiles?
There's no such thing as the PRC library, just libraries made of PRC models. A PRC library "lists model files (using absolute paths), one filename per line" (second graf in this PRC doc)
It might be pretty sensible to use something like /cath/data/v4_0_0/hmm/prc.lib
(which contains the two domains).
It looks like the evalue threshold is 0.01 :
cathdb_v4_1_0=> SELECT MAX( evalue ) FROM crosshits;
max
------
0.01
@tonyelewis It seems I cannot do "wget http://update.cathdb.info/api/data/hmm/prc.lib" or "wget http://update.cathdb.info/api/data/hmm/*.prc".
In this case, shall I grab PRC models for each S100 or each S35 or each existing domain? (To replicate the crosshit result)
Try : http://www.biochem.ucl.ac.uk/~ucbctnl/prcs.tar.gz
(please take a local copy because I'll delete that file at some point)
Copy taken! Many thanks! @tonyelewis
I tried the "comment" feature of crosshit and flagged this Xhit as chopping error
@shouldsee Thanks for your feedback on this chopping error.
The comment feature on the crosshits pages is typically only used during that particular round of crosshits so I doubt that a comment put on there will ever get read in the future.
Chopping errors are better documented on the ToBeRechopped page on the Trac Wiki (or emailed to one of us if you don't have access to that). To be useful, the comment needs to include which chain(s) you think need(s) rechopping, and any more information (eg what the current chopping problem is; why you think it's a problem; how it should be fixed etc).
@tonyelewis Thanks for the advice. I was quite amused by how it actually worked.
@sillitoe @tonyelewis I am pleased to say that the ISS-Xhit browser is on the edge of being finished. aka. The browsing functionality is nearly finished, whereas the data needs some improvement. (I need to rerun the ISS search with the hmm/seqeunces from v4_0_0 so as to make a meaningful comparison with the v4_0_0->v4_1_0 crosshit data, courtesy of @sillitoe . The current comparison is made between ISS-v4_2_0 against Xhit-v4_1_0 because I don't have Xhit-v4_2_0)
The important thing, is that ISS-v4_2_0 is flagging up many "novel" crosshits. ("novel" in that the corresponding crosshit v4_2_0 page like this is showing "There were 0 entries...." ) In other words, this novelty is only relative to the data from "xhits.cathdb.info/crosshits.php", so if this reference data is incomplete (which I am not sure of), then the novelty may not be significant.
One of the example is:
node1 | node2 | val1 | val2 | seqDB1 | seqDB2 | xhit_url |
---|---|---|---|---|---|---|
3.40.50.720.0 | 3.40.50.1100.0 | 1970 | 0 | CATH-S40-nr_Ver:4_2_0 | crosshit_Ver:4_1_0 | Crosshits_v4_1_0 |
-- | -- | -- | -- | -- | -- | -- |
For which ISS is detecting 1970 S35-S35 hits with acceptable quality, whereas nothing has been reported in the file handed over by Ian, or on the crosshit website. (Actually the v4_0_0 Xhit is reporting 2 entries which disappeared in newer versions). And the superpositions (3ip1A02_4ofxA01.pse | 4idcA02_4ht3B01.pse) are looking quite good (65-70 SSAP)
Another quick example is
node1 | node2 | val1 | val2 | seqDB1 | seqDB2 | xhit_url |
---|---|---|---|---|---|---|
3.20.20.60.0 | 3.20.20.120.0 | 287 | 0 | CATH-S40-nr_Ver:4_2_0 | crosshit_Ver:4_1_0 | Crosshits_v4_1_0 Nothing for v4_0_0 onwards |
-- | -- | -- | -- | -- | -- | -- |
Superpositions are quite good as well 2vwsA00_3dgbA02.pse 2qiwA01_3cawA02.pse
Guess: These crosshits seems to be manually hidden or suppressed, or the S35-superfamily relations gets changed across the versions.
To-Do: I will later make a comparison between ISS-v4_1_0 with Xhit-v4_1_0 to make sure these results are meaningful.
PS: I will be going to London Monday afternoon or Tuesday morning. Any chance I can meet briefly with you guys? @sillitoe @tonyelewis
It seems I misunderstood the meaning of crosshit v4_1_0. Most of the domains from the file are indeed v4_1_0 domains, so it seems justified to compare it against ISS v4_1_0 (results generated from v4_1_0 sequence data, nr-S40)
Hi guys,
I have been improving the database routine for ISS (and producing useful plots). During my browsing I happen to come across this Xhit:
The ISS_raw indicates the number of hmmsearch-domtbl-hits shared by the two superfamily. The hmmsearch is carried out using S35-reps, and the results are aggregated for each superfamily.
I have also checked the v4_1_0 and v4_2_0 crosshit databases, none was listed about there 2 SF Xhiting (saying "There were 0 entries in the Crosshit database between superfamily 3.40.50.10730 and superfamily 3.50.50.60 (ordered by HMM evalue, then SSAP score).").
More specifically, 1vj0A02 from 3.50.50.60 is hitting 1uwkA02 (the only s-rep in 3.40.50.10730) with a SSAP-score of 74.22 and RMSD of 4.07. See superpositoin pymol session.
My major question is, why this hit was not displayed in v4_1_0? The hmm(1uwkA02) - protseq(1vj0A02) hit is hightly significant (bitscore 84.500, translating to a P-value around 1E-8 ).
Please advise. @sillitoe @tonyelewis @nataliedawson
Kind regards Feng