Re-do clustering - Githubissues

ilectra commented 6 years ago

From https://github.com/UCL/HHprY-Project/issues/18 :

ACTION 4. Clustering: 4A) overlapping hits should be clustered together no matter what the pSS, leaving non-overlapping ones separate. [...] 4B) Could the number of clusters in the default view be varied by length. We could have a rule of up to 10 clusters for all proteins less than 1000 residues, then 1 extra cluster per 200 residues on top. So for YLL040W/Vps13 (3144 aa) the page would have up to 21 clusters. Does not matter about haing 21 different colours, but we could try anyway – I could supply some more pastel colours. [...]

Actions:

[x] @ilectra Make a new clustering algorithm (starting point, our methodology will most likely evolve over time until we get it right):

1) Remove hit that contains the same name as protein in its description, if such hit exists. 2) Start from hit with top probability and go down the list. 3) For every hit, compare with each hit with higher probability, and (percentages quoted are of the shorter of the two hits being compared) A) If <(max(10%,10)) residues overlapping, the hits are considered separate. [Did I write this down correctly? For example, a hit of length 30, won't be considered overlapping with another if they overlap by 9 residues, even though this is almost 1/3 of the length of the short one?] B) If >50% overlap, the hits are considered overlapping, and they'll be added to the same cluster. [What happens if there's legitimate overlap with more than one hit/cluster? - this can still be an issue for edge cases, despite point (4) below.] 4) If the hit overlapped with other(s) with percentage between 10% and 50%, cluster it together with the one with the highest overlap. 5) If the hit didn't overlap with any other, it starts it own cluster. 6) At the end, each "cluster" is a new, longer "hit", containing all the overlapping ones. [Order of transversal matters here - are clusters "creeping" to one side correct? Example:]

Hits:
---------
     --------
         ------------
Are clustered as:
---------------------
even though there's no overlap between the top and bottom one.

7) As a result of the previous procedure, anything contained fully inside something bigger, will be clustered together. 8) Keep name and probability of top hit, number of hits clustered together 9) If hits with the same name appear, cluster them together. [Do we really need to/can do this?] 10) The procedure should be able to handle both pfam (minimal overlaps) and pdb (tons of overlapping hits).

[ ] @timlevine provide list of proteins with different features, to test the clustering algorithm.

tamuri commented 6 years ago

@timlevine have you had any further thoughts about this?

ilectra commented 6 years ago

@timlevine I've started working on this, and I was wondering what is the clustering supposed to do with hits of overlap between 10% and 50%? Anything below is a clear separate hit, and anything above is a clear overlap, but what happens in-between?

ilectra commented 6 years ago

A note hidden in a different issue:

For any processing (clustering, etc....) ignore anything with prob>99%, cause it's already known. Just display it on the plot, but don't take it into account.

timlevine commented 6 years ago

i think that these strong hits will cluster with each other and a number of waeker hits, so it's not possible to exclude them from clustering

From: Ilektra Christidi notifications@github.com Sent: Wednesday, February 7, 2018 5:20:47 PM To: UCL/HHyeast-server Cc: Levine, Tim; Mention Subject: Re: [UCL/HHyeast-server] Re-do clustering (#10)

A note hidden in a different issue:

For any processing (clustering, etc....) ignore anything with prob>99%, cause it's already known. Just display it on the plot, but don't take it into account.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/UCL/HHyeast-server/issues/10#issuecomment-363843297, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM5sGKsbqmpxD9tD3KfhrTXSgFJwBU4Nks5tSdtvgaJpZM4Qh1dr.

ilectra commented 6 years ago

@timlevine Just to make clear: The note "For any processing (clustering, etc....) ignore anything with prob>99%, cause it's already known. Just display it on the plot, but don't take it into account." is to be ignored and I add everything to the clustering?

Also, what about my previous question: what happens between 10% and 50% overlap?

timlevine commented 6 years ago

TBH I am no expert on how to cluster these. For Pfam there are no instances on their curated pages at http://pfam.xfam.org where one domain overlaps another, and only ever one PFAM hit for any part of hte yeast genome is shown in SGD.

Ideally we should try to emulate that, but I dont know how we can achieve that.

I think that hits to clearly described, interesting domains >99% can be assumed to be gold standard / correct, so clusters formed from all the >99% hits could then be used to grab nearby hits that share their characteristic position. This includes things that are wholly subsumed within them (with rare exceptions) and weaker hits than they are that partially overlap (say >50% as we had before). <50% overlap, we could look for a different cluster.

The key word here is interesting: how do we make sure that we avoid using "boring" hits (e.g. PF08426 in http://hhyeast.ukwest.cloudapp.azure.com:5000/ICE2) to hide the interesting ones? (here PF03348). The solution may escape us. It is not essential to get this in clustering, as the user should not rely on it! Still you might try clustering hits that include the name of the protein only with hits of their own length, and omitting included domains.

Alternately, maybe this is telling us something, and we should just find a good enough algorithm to separate out the all the different types of hit - no "arbirtrary" rules. Is there a machine learning one, where we can mark its work on 1% of the genome before letting it loose on the rest?

From: Ilektra Christidi notifications@github.com Sent: Wednesday, February 7, 2018 5:53 PM To: UCL/HHyeast-server Cc: Levine, Tim; Mention Subject: Re: [UCL/HHyeast-server] Re-do clustering (#10)

@timlevinehttps://github.com/timlevine Just to make clear: The note "For any processing (clustering, etc....) ignore anything with prob>99%, cause it's already known. Just display it on the plot, but don't take it into account." is to be ignored and I add everything to the clustering?

Also, what about my previous question: what happens between 10% and 50% overlap?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/UCL/HHyeast-server/issues/10#issuecomment-363853237, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM5sGJokqLCO8I9RL7elevJ4GWIkborGks5tSeL_gaJpZM4Qh1dr.

ilectra commented 6 years ago

How about...

Completely remove hits with the same name as the protein from the set (the "boring" hits - I assume those will have 100% probability and cover most of the protein length)
Run the clustering as described above on the remaining hits:
- less than 10% -> no overlap
- more than 50% -> overlap
- between 10% and 50%, look for overlap with other cluster(s), and merge with the one that has the maximum overlap.

This method would have eg. http://hhyeast.ukwest.cloudapp.azure.com:5000/AFG3/pdb collapse down to 2 clusters, with the long hits on the right absorbing everything below them. Is that what you'd like to see? Are there a few proteins with different styles of hits that I could use to test my attempts?

I'd be reluctant to deploy a machine learning algorithm like a Neural Network or something before we seriously try some straight-forward solution first...

timlevine commented 6 years ago

Completely remove hits with the same name as the protein from the set (the "boring" hits - I assume those will have 100% probability and cover most of the protein length)

"With" is difficult here - I suppose containing the same string

Run the clustering as described above on the remaining hits:
- less than 10% -> no overlap
- more than 50% -> overlap
- between 10% and 50%, look for overlap with other cluster(s), and merge with the one that has the maximum overlap.

could work. OR do the easy ones (<10%, >50%) and see how many of the other ones we've got to deal with - could (should?) be very few indeed

*

This method would have eg. http://hhyeast.ukwest.cloudapp.azure.com:5000/AFG3/pdb collapse down to 2 clusters, with the long hits on the right absorbing everything below them. Is that what you'd like to see? Are there a few proteins with different styles of hits that I could use to test my attempts?

not sure where else to look. What about the pfam hits for the same protein? there PF03969 is a unique outlier that should not really be allowed to dominate the clustering, so taht ideally there should be two clusters in the right hand side pf00308 and lots more like it and pf01434, with a much smaller number.

I'd be reluctant to deploy a machine learning algorithm like a Neural Network or something before we seriously try some straight-forward solution first...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/UCL/HHyeast-server/issues/10#issuecomment-364069309, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM5sGP4iWYedJYxNrt670smMA9myGJxeks5tSsyBgaJpZM4Qh1dr.

ilectra commented 6 years ago

I updated the issue on the top of this thread, to put more details and my open questions (in [square brackets]). We should meet in person to discuss those on a whiteboard, are you available tomorrow at all?

As for the hit you're mentioning in AFG3/pfam, I can't see of any way that could pick it up as separate from the others. We need to talk/think more...

timlevine commented 6 years ago

re that hit: i would hope that it being an outlier (to my perception) might mean that it lies outside a cluster according to some alrgorithm

today was not possible for me - stuck 2 miles away in Old Street

Can you make Wednesday morning at about 10 ish or just after?

From: Ilektra Christidi notifications@github.com Sent: Thursday, February 8, 2018 2:12:18 PM To: UCL/HHyeast-server Cc: Levine, Tim; Mention Subject: Re: [UCL/HHyeast-server] Re-do clustering (#10)

I updated the issue on the top of this thread, to put more details and my open questions (in [square brackets]). We should meet in person to discuss those on a whiteboard, are you available tomorrow at all?

As for the hit you're mentioning in AFG3/pfam, I can't see of any way that could pick it up as separate from the others. We need to talk/think more...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/UCL/HHyeast-server/issues/10#issuecomment-364122951, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM5sGDtAF0Jt-i916GrreCq2no0t0VQhks5tSwDCgaJpZM4Qh1dr.

ilectra commented 6 years ago

re the hit: jut to make sure we're talking about the same thing, you mean hit PF03969 is an outlier, not PF06480 ?

Wednesdays are not good days for me, they're already booked with meetings. Any other day except Tuesday afternoon would work, though.

timlevine commented 6 years ago

see comments

re the hit: jut to make sure we're talking about the same thing, you mean hit PF03969 is an outlier,

yes

not PF06480 ?

I dont see 06480 anywhere here

the question is whether 01434 is "covered" by 03969 - i think it should not be. That can be achieved by putting 03969 into a different cluster from 00308 00493 etc.

06431 is less of a problem as its overlap with 01434 is 10- 50%. We'll come back to that sort of thing later.

Really the algorithm ought (in my mind) to define the main cluster here as 280-510 as that is the typical width of the hits that my mind groups together. The outliers (03969/06431) are just that - they lye out of the main group

Wednesdays are not good days for me, they're already booked with meetings. Any other day except Tuesday afternoon would work, though.

Could you come here - EC1V 9EL (3 stops on the tube) - any day except Friday = plus wednedsay 11-1 and Tuesday afternoon

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/UCL/HHyeast-server/issues/10#issuecomment-364457248, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AM5sGAFf8qNq2-yIAMT9tLmTeJz_T2Ppks5tTF16gaJpZM4Qh1dr.

UCL / HHyeast-server

Re-do clustering #10