LumenPallidium / neural-file-sorter

A neural network based file sorter. Trains an autoencoder to sort images or audio based on the similarity of their encodings, or uses the OpenAI CLIP model.
MIT License
27 stars 1 forks source link

no clip labels question #19

Open sneccc opened 1 year ago

sneccc commented 1 year ago

When i use no clip labels and at the same time i use estimate_k = True i get only 2-3 clusters, is there a way to increase this number and force more cluster that have similar features, without disabling estimate_k ? if i disable estimate_k i have to guess moreless how many clusters i need and end up with too many clusters

LumenPallidium commented 1 year ago

I think you would need to modify the get_best_kmeans function, which relies on the silhouette score. Some other options for clustering metrics can be found here: https://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation

sneccc commented 1 year ago

@LumenPallidium in hierarchical clustering cant we use clip and tags like we compare the first tags ex "art,realism,design" then each image picks one label and goes down the tree, if it pick art it now compares for example "watercolor,pointilist,oilpainting,graphitti etc" etc so we can define a strcuture and tell the clip to pick the best of options and at the end it organizes everything nicelly, like an image of a lion could be in [realism -> wild photography -> lion ] folder

i was thinking like we define in a json that each node has children, so instead of clip comparing all the tags at the same time , we compare level by level of the tree until it reaches a final leaf

iam not sure how accurate hierarchical clustering is if it doesnt use clip, it tried it and in the 3d plot it looked off, idk if needs more time to train, or any ajustments, it renames the files to just numbers, idk what it means from what i saw it should plot a Dendrogram no? image

LumenPallidium commented 1 year ago

Hmm that is an interesting idea, I can look into it. I think it might look something like "given the nth level of the hierarchical cluster, what is the top-1 class". Not sure it could be guaranteed to follow a exact hierarchy in "class-space" though, but I will think about it some more.

As for the second part, I actually did not intend to use hierarchical clustering with plotting (since the end result is always a unique label for each point). It does rename the files to numbers (there are args, like n_symbols, for the HierarchicalClusterer class that can change this if you have many files), in such a way that they are sorted in a gradient based on their categorical similarity (very nice imo 😊). Attached is an example pic:

Screenshot 2023-06-20 at 10 13 46 PM

I'll document this some more, and perhaps add some warnings on run in case hierarchical clustering is used with the viz or k-means is used with reorganizing with rename (it should only be used when rename = False).