biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
127 stars 84 forks source link

Annotated Corpus Map: allow at least 2 decimals for ε with DBSCAN #1000

Closed wvdvegte closed 1 year ago

wvdvegte commented 1 year ago

Is your feature request related to a problem? Please describe. When applying DBSCAN clustering in Annotated Corpus Map, epsilon can be specified with 1 decimal. This seems to give rather coarse options for control in the case of certain ranges of input variables for x and y. I'm currently investigating a corpus where, after applying BoW and t-SNE, the following numbers of clusters are found for different epsilon settings:

ε = 0.5 -> 1 "cluster" (default setting) ε = 0.4 -> 3 clusters ε = 0.3 -> 6 clusters ε = 0.2 -> 11 clusters ε = 0.1 -> 19 clusters

So, the jumps in numbers of clusters are rather big. It would be nice to have finer control.

Describe the solution you'd like Allow at least 2 decimals, perhaps even more. Alternative: allow specification of a scaling factor for the x and y axis variables before applying DBSCAN (but I think this is less intuitive)

Describe alternatives you've considered Use Feature constructor to multiply the x and y axis variables by a factor. In my case this worked fine using the factor 10 for the t-SNE scales, but it's a kludgy workaround

PrimozGodec commented 1 year ago

Thank you for reporting. I agree; I think we should enable setting more decimals. :)