clustering methods - Githubissues

That I couldn't give a meaningful answer to. I switched from using DBScan in an earlier iteration of this tool to hierachiacal clustering with an ANNOY index as an experiment to improve memory usage when grouping larger datasets with the understanding that it would mean slightly less accuracy but figured that wouldn't be an issue as in this case we're working with approximately similar embeddings rather than exactly similar ones anyway. If there is a performance benefit to this change in approach then it's not one I anticipated.

In any case I plan to get around to putting together another update to switch this tool to using FAISS in place of hierachical clustering on an ANNOY index. This change should in theory scale to larger datasets better while maintaining the same quality of grouping results as my current approach. So, if you're looking to built a similar tool I'd recommend checking out FAISS.

LexCybermac / smlr

clustering methods #1