chadadavis / sbglib

Automatically exported from code.google.com/p/sbglib
0 stars 0 forks source link

Faster (generic) clustering #139

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Faster distance matrix, for distances that are a 'metric'

See t/SBG/Network.t
Measure the difference.

Rathern than (N) * (N-1) / 2
Do: put everything into a list, one item per box.
start with first box, take first item.
Measure distance to the item in every other box, reclassifying them as you go, 
into bins, say 10% difference per bin.
If the 100% bin is empty, then the current item is a singleton
Any bins with just one item are singletons
recursively process any othe bins that have any thing in them, taking the first 
item from a givne bin, comparing it to all others.

From G:
20 D H C B E A F 
30 I

So, I is a singleton, process any plural clusters
And G didn't find any > 90 hits, so it's a singleton too

From D:
20 H 
30 C I B A (we wouldn't have processed I here after removing singleton)
100 E F 

Since there are members of 100, D+E+F, cross off 100
Also cross off H as a singleton

From C:
100 B A

Done, and since C has a 100 group, link it and remove all of them.
Now, none left. Clusters are:

G
I
D E F
H
C B A

Original issue reported on code.google.com by chad.a.davis@gmail.com on 17 Jun 2010 at 3:55

GoogleCodeExporter commented 9 years ago
This nearly works. 
It fails when two sequences are close to each other, but in separate bins.
This could be resolved by not binning, but by clustering.
Take the head sequence and align to each other. Put these on the number line 
0-100. Starting at 100, take all sequences until one sequence is more than 10 
percentage points from the one above it. If this group started somewhere above 
90, then it's the same group as the head sequence. Otherwise process the rest 
recursively. Find the next 'cluster' on the number line with no more than 10 
percentage points of gap. Take the head of that group and align it to the 
others in the group. Process recursively, returning a list.

Original comment by chad.a.davis@gmail.com on 22 Jun 2010 at 1:11

GoogleCodeExporter commented 9 years ago
Probably should just be doing standard linkage here (to avoid all-vs-all)
See Matt's code.

Original comment by chad.a.davis@gmail.com on 27 May 2011 at 7:28

GoogleCodeExporter commented 9 years ago

Original comment by chad.a.davis@gmail.com on 27 May 2011 at 7:33

GoogleCodeExporter commented 9 years ago
use Algorithm::Cluster::Thresh with SBG::Stamp::superposition()->scores->{Sc} 
as as distance metric (via Algorithm::DistanceMatrix)

Original comment by chad.a.davis@gmail.com on 19 Oct 2011 at 3:40

GoogleCodeExporter commented 9 years ago

Original comment by chad.a.davis@gmail.com on 19 Oct 2011 at 3:42