Evaluate a novel cross-hit

shouldsee commented 7 years ago

Hi guys,

I have been improving the database routine for ISS (and producing useful plots). During my browsing I happen to come across this Xhit:

node1	node2	node1__hit_count	node2__hit_count	average hit count	ISS_raw	ISS_norm
3.40.50.10730.0	3.50.50.60.0	200.000	734.000	383.145	46.000	0.647
--	--	--	--	--	--	--

The ISS_raw indicates the number of hmmsearch-domtbl-hits shared by the two superfamily. The hmmsearch is carried out using S35-reps, and the results are aggregated for each superfamily.

I have also checked the v4_1_0 and v4_2_0 crosshit databases, none was listed about there 2 SF Xhiting (saying "There were 0 entries in the Crosshit database between superfamily 3.40.50.10730 and superfamily 3.50.50.60 (ordered by HMM evalue, then SSAP score).").

More specifically, 1vj0A02 from 3.50.50.60 is hitting 1uwkA02 (the only s-rep in 3.40.50.10730) with a SSAP-score of 74.22 and RMSD of 4.07. See superpositoin pymol session.

My major question is, why this hit was not displayed in v4_1_0? The hmm(1uwkA02) - protseq(1vj0A02) hit is hightly significant (bitscore 84.500, translating to a P-value around 1E-8 ).

Please advise. @sillitoe @tonyelewis @nataliedawson

Kind regards Feng

tonyelewis commented 7 years ago

I don't know the full answer but...

The crosshits protocol has been to do an all-vs-all PRC comparison of s-reps and then a SSAP of any pairs that get good-enough PRC result. Sure enough, though both 1vj0A02 and 1uwkA02 appear to have been included in the v4.1.0 lists, neither of these PRC results files makes mention of the other domain:

/cath/data/current/xhitsresults/v4_1_0/1vj0A02.PRC.results.xml
/cath/data/current/xhitsresults/v4_1_0/1uwkA02.PRC.results.xml

So the questions are:

why didn't we get a strong PRC hit between these two domains when you're finding that your different method does find a strong hit?
should that worry us?

We still have a few versions of the relevant models:

/cath/data/v4_0_0/hmm/1uwkA02.prc
/cath/data/current/hmm/1uwkA02.prc
/cath/data/v4_0_0/hmm/1vj0A02.prc
/cath/data/current/hmm/1vj0A02.prc

I think our PRC scans have often used commands like:

/opt/local/apps/linux/bin/prc/prc -algo viterbi some_prc_model.prc some_prc_library.lib

sillitoe commented 7 years ago

This all looks interesting @shouldsee - good work (and thanks for the info @tonyelewis)

For info, running PRC on the v4.0 models gives:

$ /opt/local/apps/linux/bin/prc/prc -algo viterbi /cath/data/v4_0_0/hmm/1uwkA02.prc /cath/data/v4_0_0/hmm/1vj0A02.prc
PRC 1.5.3 (PLAN9, SPACE5, ALL_TRANS), compiled on Mar 28 2007
Copyright (C) 2002-5 Martin Madera and MRC LMB, Cambridge, UK
Freely distributed under the GNU General Public License

Command     : /opt/local/apps/linux/bin/prc/prc -algo viterbi /cath/data/v4_0_0/hmm/1uwkA02.prc /cath/data/v4_0_0/hmm/1vj0A02.prc
Algorithm   : Viterbi
Match-match : dot2
Align mode  : local-local
Alignments  : none
Simple stop : 1.5
Max hits    : 100
Started     : Wed Sep 13 10:01:54 2017

# hmm1  start1  end1    length1 hit_no  hmm2    start2  end2    length2 simple  reverse
1uwkA02 25  90  208 1   1vj0A02 17  86  147   11.2     5.3
1uwkA02 177 201 208 2   1vj0A02 45  69  147    4.5    -1.5
1uwkA02 183 194 208 3   1vj0A02 30  41  147    3.7    -2.3
1uwkA02 32  71  208 4   1vj0A02 27  66  147    3.5    -2.4
1uwkA02 4   7   208 5   1vj0A02 22  25  147    3.5    -2.5
...

I'm not sure how these scores get processed by the xhits (i.e. cutoffs).

If we assume that this relationship wasn't in the xhits because it didn't meet some part of the criteria (rather than a bug), then it looks like this would be a candidate for a potential remote homology (which is precisely what ISS is trying to achieve).

ISS is meant to provide an alternative approach to homology detection - the objective is to identify homologous relationships that we haven't been able to spot with existing methods (i.e. strong links between domains currently in different superfamilies).

The question then is: how do we know that the putative relationship that you've pulled out is a "true" remote homologous relationship or a false positive? Ultimately - detective work (manual curation):

is there a strong signal in the sequence comparison?
is there a strong signal in the structure comparison? (global structural similarity or any particularly unusual structural features)
is there a clear link in the function (or functional mechanism)?

Generally speaking, we're looking for 2 of 3 of those criteria. @nataliedawson is our curator - she's busy this week but we can arrange a meeting next week to go through this if you're around.

shouldsee commented 7 years ago

Hi guys,

Thanks for your prompt responses!

I dug up a trac ticket 768 by @tonyelewis from 5 years ago. It seems SSAP=74.02 < 80, PRC_raw_score=11.2 probably translate to > 0.001, which might be causing the discarding. What do you think of this hypothesis @tonyelewis ?

........ We have come up with the following criteria for automatically assigning domain X to superfamily Y:

X has SSAP score >= 80% over an overlap >= 80% and a PRC evalue <= 0.001 with some s-rep in Y. X does not achieve this with an s-rep in any superfamily other than Y (perhaps to be safer, these cutoffs should be loosened a bit). Y is not in the list of xhits trouble-causers. To sanity-check this, we plotted SSAP scores vs PRC evalues for the data we've got for s-rep pairs in v3.5.0 (and where SSAP overlap >= 80%). I will attach the resulting graph (one version is zoomed in; the other is zoomed out). They look reassuring.

We eye-balled the false positives to check that they are xhits (and hence should be true positives). They also looked reassuring.

We're hoping to get moving on v3.6.0 ASAP so I'm hoping to implement this in time to run over the weekend.

@sillitoe Out of the three criteria you mentioned, I am particularly incapable in terms of identifying the "functional linkage" so probably would focus on the other two. ISS is ultimately exploiting the sequence information, whereas structural comparison would probably come in the validation process.

The result from my ISS searches still need a lot of sanitisation (most of them are nonsenses), thus I would probably go:

[x] fixing the obvious pitfalls ( handling non-overlapping/disjoint hits to the same sequence),
[ ] as well as writing a SSAP wrapper to facilitate testing.
[x] Then integrating existing Xhit results....

Lots to do! XD

As for the final implementation,

I envision it taking the form of a binary to accept a hmm_profile, a hmm_lib and a sequence_lib, and output a hit list. The painful bit is to search hmm_lib against sequence_lib (using hmmsearch) and the result should probably be cached to hmmhits ("hmmserach hmm_lib sequence_lib > hmmhits") so that we can do " iss hmmprofile hmmhits > result " or "iss hmmprofile_hit hmmhits > result".

The implementation I current have is " iss hmmhits > result ", which is kind of weird because it lists every possible pair of hits within the hmm_lib, which I think is less intuitive to visualise, but runs faster than concatenating indivdiual query results.

BTW, I would really like to hear from @sillitoe your ideas on current documentation of master branch 0eb5b50d16a3b4131f0be289b31e5f41c6425c5a , which consists of old functionality mentioned in my last presentation.#

I am probably free next week. What time do you prefer? @sillitoe

Kind regards Feng

tonyelewis commented 7 years ago

I dug up a trac ticket 768 by @tonyelewis from 5 years ago. It seems SSAP=74.02 < 80, PRC_raw_score=11.2 probably translate to > 0.001, which might be causing the discarding. What do you think of this hypothesis @tonyelewis ?

I think those are the criteria for post-scan-auto-assign. So if you're asking about post-scan-auto-assign, then: yes, given that the SSAP score fails the criteria then that would mean that the second domain that got assigned wouldn't have been post-scan-auto-assigned based on a match with the first (if that had been implemented at the time the second domain was being assessed).

If you're asking about crosshits, then: no, these criteria are irrelevant. These two were included in the v4.1 crosshits scans but had no PRC result, which means either that there was an error or that they result didn't meet the relevant threshold.

I think you may need to run PRC against a library to get an evalue.

shouldsee commented 7 years ago

@tonyelewis Can you plz show me the path to the PRC library? What do you think is the PRC evalue/pvalue threshold exactly?

@sillitoe I am using "http://update.cathdb.info/api/data/" to access "/cath/data/current/", but how can I access "/cath/data/v4_0_0/" to grab the old *.prc profiles?

tonyelewis commented 7 years ago

There's no such thing as the PRC library, just libraries made of PRC models. A PRC library "lists model files (using absolute paths), one filename per line" (second graf in this PRC doc)

It might be pretty sensible to use something like /cath/data/v4_0_0/hmm/prc.lib (which contains the two domains).

It looks like the evalue threshold is 0.01 :

cathdb_v4_1_0=> SELECT MAX( evalue ) FROM crosshits;
 max  
------
 0.01

shouldsee commented 7 years ago

@tonyelewis It seems I cannot do "wget http://update.cathdb.info/api/data/hmm/prc.lib" or "wget http://update.cathdb.info/api/data/hmm/*.prc".

In this case, shall I grab PRC models for each S100 or each S35 or each existing domain? (To replicate the crosshit result)

tonyelewis commented 7 years ago

Try : http://www.biochem.ucl.ac.uk/~ucbctnl/prcs.tar.gz

(please take a local copy because I'll delete that file at some point)

shouldsee commented 7 years ago

Copy taken! Many thanks! @tonyelewis

shouldsee commented 7 years ago

I tried the "comment" feature of crosshit and flagged this Xhit as chopping error

tonyelewis commented 7 years ago

@shouldsee Thanks for your feedback on this chopping error.

The comment feature on the crosshits pages is typically only used during that particular round of crosshits so I doubt that a comment put on there will ever get read in the future.

Chopping errors are better documented on the ToBeRechopped page on the Trac Wiki (or emailed to one of us if you don't have access to that). To be useful, the comment needs to include which chain(s) you think need(s) rechopping, and any more information (eg what the current chopping problem is; why you think it's a problem; how it should be fixed etc).

shouldsee commented 7 years ago

@tonyelewis Thanks for the advice. I was quite amused by how it actually worked.

@sillitoe @tonyelewis I am pleased to say that the ISS-Xhit browser is on the edge of being finished. aka. The browsing functionality is nearly finished, whereas the data needs some improvement. (I need to rerun the ISS search with the hmm/seqeunces from v4_0_0 so as to make a meaningful comparison with the v4_0_0->v4_1_0 crosshit data, courtesy of @sillitoe . The current comparison is made between ISS-v4_2_0 against Xhit-v4_1_0 because I don't have Xhit-v4_2_0)

The important thing, is that ISS-v4_2_0 is flagging up many "novel" crosshits. ("novel" in that the corresponding crosshit v4_2_0 page like this is showing "There were 0 entries...." ) In other words, this novelty is only relative to the data from "xhits.cathdb.info/crosshits.php", so if this reference data is incomplete (which I am not sure of), then the novelty may not be significant.

Comparing ISS-v4_2_0 to Xhit-v4_1_0

One of the example is:

node1	node2	val1	val2	seqDB1	seqDB2	xhit_url
3.40.50.720.0	3.40.50.1100.0	1970	0	CATH-S40-nr_Ver:4_2_0	crosshit_Ver:4_1_0	Crosshits_v4_1_0
--	--	--	--	--	--	--

For which ISS is detecting 1970 S35-S35 hits with acceptable quality, whereas nothing has been reported in the file handed over by Ian, or on the crosshit website. (Actually the v4_0_0 Xhit is reporting 2 entries which disappeared in newer versions). And the superpositions (3ip1A02_4ofxA01.pse | 4idcA02_4ht3B01.pse) are looking quite good (65-70 SSAP)

Another quick example is

node1	node2	val1	val2	seqDB1	seqDB2	xhit_url
3.20.20.60.0	3.20.20.120.0	287	0	CATH-S40-nr_Ver:4_2_0	crosshit_Ver:4_1_0	Crosshits_v4_1_0 Nothing for v4_0_0 onwards
--	--	--	--	--	--	--

Superpositions are quite good as well 2vwsA00_3dgbA02.pse 2qiwA01_3cawA02.pse

Guess: These crosshits seems to be manually hidden or suppressed, or the S35-superfamily relations gets changed across the versions.

To-Do: I will later make a comparison between ISS-v4_1_0 with Xhit-v4_1_0 to make sure these results are meaningful.

PS: I will be going to London Monday afternoon or Tuesday morning. Any chance I can meet briefly with you guys? @sillitoe @tonyelewis

shouldsee commented 7 years ago

Update:

It seems I misunderstood the meaning of crosshit v4_1_0. Most of the domains from the file are indeed v4_1_0 domains, so it seems justified to compare it against ISS v4_1_0 (results generated from v4_1_0 sequence data, nr-S40)

CATH-summer-2017 / django_CATH