Limitations of UMAP for Visualizing High Dimensional Data with regards to Counterfactual Explanations

UMAP, or Uniform Manifold Approximation and Projection, is a popular technique for dimensionality reduction and visualization of high-dimensional data. However, there are several limitations and caveats in its use, particularly when it comes to visualizing data with regards to counterfactual explanations. Two key issues stand out: the interpretation of distances and cluster sizes, and the representation of decision boundaries.

Distances and Cluster Sizes in UMAP

UMAP has the ability to capture both local and global structures in data, making it highly appealing for visualizing complex datasets. However, the representation it provides may not accurately reflect the actual distances and differences between data points in the original high-dimensional space.

In particular, the distances and cluster sizes produced by UMAP do not have a consistent or meaningful interpretation. UMAP optimizes for a balance between preserving local and global structure, but this means that the distances between points in the low-dimensional UMAP representation do not directly correlate with their distances in the original high-dimensional space. Therefore, while UMAP might visually group similar data points together, the sizes of these groups and the distances between them are not indicative of the same measures in the original dataset.

Decision Boundaries and UMAP

Decision boundaries separate regions of different classifications and help provide insights into how changes in input variables can lead to different predictions.

However, when the data is transformed using UMAP for visualization, the representation of these decision boundaries becomes skewed. The complex, high-dimensional decision boundary from the original space is forced into a lower-dimensional representation. This transformation distorts the boundary, making it difficult to accurately interpret. As a result, decision boundaries visualized in the UMAP space may not correctly represent the true boundary in the original high-dimensional data space. Because of this, meaningful counterfactual explanations are not possible.

Conclusion

In conclusion, while UMAP provides a powerful tool for visualizing high-dimensional data, it comes with limitations that should be recognized when using it for applications related to counterfactual explanations. The inaccurate representations of distances, cluster sizes, and decision boundaries can mislead interpretation and hinder accurate understanding of the data and decision-making process.

JuliaTrustworthyAI / CounterfactualExplanations.jl

Improve visualization #184

Limitations of UMAP for Visualizing High Dimensional Data with regards to Counterfactual Explanations

Distances and Cluster Sizes in UMAP

Decision Boundaries and UMAP

Conclusion