Distance Matrix: allow basic statistics on distances

wvdvegte commented 1 year ago

What's your use case? I performed clustering on a corpus of documents based on t-SNE coordinates. For further analysis, I would like to extract, for each cluster, which other cluster is furthest away, i.e., the most dissimilar. To that end, I computed the average t-SNE x and y coordinates for each cluster using Group By, and then computed Distances based on the coordinates. Based on this, I can create a Distance Matrix like this: What I would like to extract in an automated way is, for each column, the row ID with the greatest distance and the value of that distance. For other purposes, it may also be useful to get the row ID with the smallest distance with its value, the average distance in each column, etc. In 'normal' use of the distance matrix, where each row/column represents a data point, it could also be useful to automatically extract for each data point, which other data point is furthest away, how far away it is, etc.

What's your proposed solution? Several options, from most useful to least useful:

Create an output with distance statistics (as described below the above image) that can be viewed in Data Table, saved as a file, etc.
Create an output with the distances as a 'regular' data table, with labels as row IDs and the same labels as column headers. This would allow some further processing, but I don't see how this gives me the desired results directly. Nevertheless, I can at least consecutively sort descending for each column and get out the max. distances manually.
Make the Distance Matrix manipulable the same way as Data Table: allow sorting by clicking on column headers. This would make it easier to manually get the max. distance with associated cluster.

Are there any alternative solutions?

Use Save Distance Matrix, and analyze further using spreadsheet software. Disadvantages: labels have to be added manually, and the top-right half of the matrix is missing, so it has to be created quasi-manually with formulas in the spreadsheet. After that, the max. distances can again be obtained by repeated sorting.
Use Distance Map, and use a color gradient that makes it relatively easy to pick out the max. values per column (or row). Again, a largely manual approach that isn't without risk of errors.

janezd commented 1 year ago

We discussed this at today's meeting.

We'd add an option to order the objects based on leaf-ordered clustering. This will help you find the closest instance.

What I would like to extract in an automated way is, for each column, the row ID with the greatest distance and the value of that distance. For other purposes, it may also be useful to get the row ID with the smallest distance with its value, the average distance in each column, etc.

This makes sense but doesn't belong to this widget. It is not related to (visual representation of) Distance matrix. We could have a separate widget that would be given a matrix and show a table with names of objects (like now in Distance matrix) and the nearest or the farthest (user's choice) object, together with distance. The widget would also output this table in case the user would want to save it. (There's not much else that one could do with this table.) Does this sound OK?

wvdvegte commented 1 year ago

Yes, I think it makes sense. It could be a 'Distance Analysis' or 'Matrix Analysis' widget. How the leaf-ordered clustering will work is not completely clear to me, but I'll give it a try once it's there.

biolab / orange3

Distance Matrix: allow basic statistics on distances #6556