Clustering speedup - Githubissues

kwalcock commented 1 year ago

The Python code probably shouldn't be translated directly into a different language as is. It should first be optimized for efficiency and then ported if still necessary.

For example, if the documents are not changing, the meta_centroid is constant and doesn't need to be recalculated.
The composite value for a cluster does not change between the calculation for betweenness and withinness and doesn't need to be calculated twice.

More to follow...

kwalcock commented 1 year ago

In code like

for i in range(2, self.k_max + 1): # Must be inclusive
    self.k = max(i, len(seeded_document_clusters)) # If the user has seeded more clusters than the k you're considering, then you can't reduce that number

it looks like the minimum value that can be used is len(seeded_document_clusters). For all values of i lower than that, the minimum value will be reused and the results will be recalculated for the same k. That could waste some time.

for i in range(2, 6) with 4 seeded_document_clusters, k would be	i	k
2	4
3	4
4	4
5	5

That 2 can probably be max(2, len(seeeded_document_clusters).

kwalcock commented 1 year ago

@poleseArthur, checking whether notifications are working...

kwalcock commented 1 year ago

However, when a repeated k value is tried, the randomized seeding by pairs will end up being different, resulting in possibly multiple answers for the same repeated k. This might give that k an advantage over other ones. Not trying the repeated values would lead to different results.

kwalcock commented 1 year ago

The soft-kmeans algorithm is working with linear time complexity. The graph below is for k = 5 clusters. The exponent is ~1.0. It doesn't seem like it's the algorithm that is holding us back. If the implementation can be sped up, we should be good, @Allegra-Cohen, @poleseArthur. There are problems with convergence, though. Code is in kwalcock/test branch.

kwalcock commented 1 year ago

I did some optimizing of the Python code just by reducing the amount of unnecessary recalculation, but it only sped up the result by about 2 seconds out of 30, so not significantly. I'm doing a quick and dirty translation to Scala (on the JVM) to get an idea of what the possibilities are. After that there's the possibility of converting the clustering to compiled code via Scala native or replacing all of the Python backend with a different backend so that we don't need three different languages. The goal is still to minimize the changes. FYI @poleseArthur.

poleseArthur commented 1 year ago

Great! I just think that replace all of the Python backend would be a huge change on the project and probably, we are going spend a lot of time on this.

Allegra-Cohen / grid

Clustering speedup #68