exhuma / python-cluster

Simple clustering library for python.
GNU Lesser General Public License v2.1
65 stars 27 forks source link

Provide an option to sort unsortable objects #23

Closed exhuma closed 6 years ago

exhuma commented 7 years ago

When running the HierachicalCluster algorythm, items must be sortable. This is not always the case, and it would be useful to have the option to inject a custom sort function.

Original report via @anupamme:

I believe HierarchicalClustering also does not work with custom objects e.g. I have following code:

cl = HierarchicalClustering(salary_head_probables_list, lambda x,y: _find_squared_distance(
        x['poly_center'], y['poly_center']))

where, salary_head_probables_list is a list of custom object _find_squared_distance returns of type float

I am getting this error trace:

File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/cluster/method/hierarchical.py", line 79, in __init__
    BaseClusterMethod.__init__(self, sorted(data), distance_function)
TypeError: unorderable types: dict() < dict()
exhuma commented 7 years ago

Internally, the algorithm sorts the items which are given as data. This means that the items must be sortable. This can be achieved by using a custom object that has a def __lt__(self, other) method.

This may be a workaround once a better solution is available.

I've tried the above example with the following:

from cluster import HierarchicalClustering

def dist_funtion(a, b):
    return a.poly_center - b.poly_center

class Obj:

    def __init__(self, poly_center):
        self.poly_center = poly_center

    def __lt__(self, other):
        return self.poly_center < other.poly_center

data = [
    Obj(1.1),
    Obj(1.2),
    Obj(2.3),
    Obj(2.4),
    Obj(3.5),
    Obj(3.6),
    Obj(4.7),
    Obj(4.8),
]
cl = HierarchicalClustering(data, dist_funtion)
anupamme commented 7 years ago

Hi @exhuma

I tried this. It works as much you have mentioned in the comment. If we go further and form the cluster e.g.

cl.getlevel(100) or cl.getlevel(1) or cl.getlevel(0.01)

we get the same result. I think clustering is not happening correctly. Or I am missing something here.

exhuma commented 7 years ago

Yeah. I realised the same thing after submitting this. I'll have to give it a more detailed look. It would help a lot if I had a fully working example, like the one I posted above in the ticket description. I'm currently a bit swamped with other work so my time I can invest in this is extremely limited. Having a working example (can be fairly small), also with an expected output would help a lot.

I think I should find some time on Saturday or Sunday. I'll create my own test-set based on the one above, but if you have something which is closer to your data set we can avoid other potential problems.

anupamme commented 7 years ago

Why cant we use the same dataset?

data = [ Obj(1.1), Obj(1.2), Obj(2.3), Obj(2.4), Obj(3.5), Obj(3.6), Obj(4.7), Obj(4.8), ]

when I do cl.getlevel(0.1) I should get result as:

[ [ Obj(1.1), Obj(1.2)], [ Obj(2.3), Obj(2.4)], [ Obj(3.5), Obj(3.6)], [ Obj(4.7), Obj(4.8)] ]

Or you wanted something else?

exhuma commented 7 years ago

I've been looking into this over the past two days and I still have a bit of trouble of figuring out exactly what's broken. Unfortunately, I have an important deadline coming up and I need to focus on that one first.

From what I've seen so far in the code, this issue might require some major refactoring, so I can't promise anything. This project dates back to early 2007 and I have to admit that - even though I've written most of it - I have to invest time to dig back into it.

anupamme commented 7 years ago

Ok, no problem.

Do you think you will be able to look at it later this week? My timeline is flexible so I can wait for a few days.

If you have some clarity on how it can be achieved and just describe it to me that will also work.

exhuma commented 7 years ago

I doubt I will be able to look at it this month even. I have two Python and on Git training coming up in June and July which still need to be fleshed out a bit, and an important project to deploy for July as well. It's looking highly unlikely that I will be able to touch this in this summer.

I am aware that this is not the answer you were hoping to hear :(

I'm doing all those things in my free time which gives me about 2-3 hours per day to work on them, if I keep some time for "real life stuff" :)

To get to the problem of the library:

I tried to investigate what was going on but got stuck in the way the "distance matrix" is constructed internally. It works on a list with heterogenuous items which makes the code hard to follow. It also forces me to do some isinstance checks needed to determine of an item in the matrix is an "atomic" data-item as passed in by the user, or a cluster.

I'm not sure if this is the cause of the problem, but it looks to me like the aforementioned checks might cause some false-positives. my plan for this is to "wrap" the items that the user sends into the library in some form of "cluster" instance. But looking at the Cluster class which already exists, I'm not sure if that would be usable. Once the items are wrapped, the checks would become easier. But again, I'm not sure if this would solve the issue, but it might make it easier to identify the problem.

It is very difficult to refrain from rewriting most of this code... I've implemented this more than 10 years ago, and I learned a lot since then. But one thing I also learned is that complete re-writes are rarely the way to go as - while you fix old bugs - you will introduce new ones. As tempting as it is to rewrite, I will refrain from doing so for now...

tim-littlefair commented 6 years ago

I've created a pull request https://github.com/exhuma/python-cluster/pull/29, based on reproduction of issue #28. From exhuma's comment of 15 May 2017, it is possible this pull request might also be relevant to this issue. Note that I do not expect the changes in this pull request to change library behaviour at all, they just add a test (related to issue #28) and some logging which highlights divergence between the behaviour under Python2.7 and Python3.5 at the passage in matrix.py which is attempting to determine whether a data item is 'atomic' (as discussed in exhuma's comment).

exhuma commented 6 years ago

Thanks @tim-littlefair. I am currently drowning in work on other projects and may need some time to review this all. I will definitely need a few more weeks to clean out my schedule to work on this.

tim-littlefair commented 6 years ago

Thanks Michel/@exhuma No hurry at all from my point of view. Thanks again for making the project available. Let me know if there is any way I can help. Tim

exhuma commented 6 years ago

@anupamme I finally got around to look into this and the solution turns out to be dead-simple:

In my example above I gave a distance function which can return negative values. This does not make sense. A distance can never be negative. And this trips up the logic finding the "minimum distance" between items!

Replacing the distance function with the following solves the issue:

def dist_funtion(a, b):
    return abs(a.poly_center - b.poly_center)

Concerning the original topic of this thread I have to admit that in hindsight I do not consider the "unsortable" issue as something that would make sense to address in the library. The error message TypeError: unorderable types: dict() < dict() is the standard error message you get for sorting unsortable stuff in Python. Wrapping something around this would only obscure/obfuscate the root cause which is why I will not address this!

In the end, it is the responsibility of the library-user to provide a correct distance function and orderable types.

I will however add a small note about this in the docs. I will not add an implicit all to abs in case someone finds an edge case where this might not be desirable.