curiosity-ai / umap-sharp

C# library for fast embeddings projection using Uniform Manifold Approximation and Projection
MIT License
36 stars 5 forks source link

Support for generic distance calculation delegate #9

Open amaid opened 11 months ago

amaid commented 11 months ago

Closes #8 This pull request appears to be making changes to a C# codebase, likely related to a machine learning or data analysis library. The changes seem to be focused on introducing generic types and refining the code structure. Below is a description of the changes made in this pull request:

  1. DistanceCalculation Delegate Change:

    • A delegate called DistanceCalculation is modified to become a generic delegate DistanceCalculation<T> where T must implement the IUmapDataPoint interface. This change allows for more flexible distance calculations that can work with different types of data points.
  2. IUmapDataPoint Interface Addition:

    • A new interface named IUmapDataPoint is added to the UMAP namespace. This interface represents a single data point that will be processed by the UMAP algorithm. It requires implementing classes to provide a Data property, likely representing the data associated with the data point.
  3. NNDescent Changes:

    • The NNDescent class, which appears to be related to nearest neighbor descent, is updated to be a generic class NNDescent<T> where T must implement the IUmapDataPoint interface. This change ensures that the class can work with different types of data points.
  4. SIMD and SIMDInt Changes:

    • The SIMD and SIMDInt classes are updated to become generic classes SIMD<T> and SIMDInt<T>. This change likely reflects the need to work with different data types (possibly floating-point and integer) depending on the data points used.
  5. Tree Changes:

    • The Tree class is updated to become a generic class Tree<T> where T must implement the IUmapDataPoint interface. This change ensures that the class can work with different types of data points.
  6. Umap Class Changes:

    • The Umap class is updated to become a generic class Umap<T> where T must implement the IUmapDataPoint interface. This change allows the UMAP algorithm to operate on different types of data points.
    • The DistanceCalculation field is changed to DistanceCalculation<T> to reflect the use of a generic distance calculation delegate.
    • Various methods and internal structures within the Umap class are adjusted to accommodate the use of generic data types.
  7. Distance Functions:

    • The DistanceFunctions class defines different distance calculation functions such as cosine and Euclidean distance. These functions are updated to accept generic data types (T) instead of float arrays.
  8. OptimizationState Class:

    • The OptimizationState class appears to store various parameters and state information related to the UMAP algorithm. It is not directly impacted by the generic changes but is part of the UMAP class.

In summary, this pull request introduces generic type support for the UMAP algorithm, allowing it to work with different data point types while maintaining flexibility in distance calculations and optimizations. Additionally, it adds an interface (IUmapDataPoint) for representing data points processed by UMAP.

amaid commented 11 months ago

@theolivenbaum - Pull request created to address issue #8 (Distance calculation using additional data)

Arlodotexe commented 11 months ago

đŸ‘‹ @amaid You've essentially abstracted away each separate vector in the embedding, rather than abstracting away the vector array float[]. I'm going to open a PR to your fork to fix this, which will show here once merged.

Arlodotexe commented 11 months ago

@amaid I've got a PR opened on your fork. Once merged, the changes should show in this PR automatically.

Needed for https://github.com/Arlodotexe/OwlCore.AI.Exocortex

amaid commented 11 months ago

@amaid I've got a PR opened on your fork. Once merged, the changes should show in this PR automatically.

Needed for https://github.com/Arlodotexe/OwlCore.AI.Exocortex

Awesome! merged.

Arlodotexe commented 11 months ago

@theolivenbaum I recommend squashing this PR when merging/closing.

I'm also aware that this is a breaking change to the library, which will need a bit of discussion and possibly a migration guide.

Or, maybe we should just ship an inbox non-generic implementation to polyfill what we had before? Something like:

public class RawVectorArrayUmapDataPoint : IUmapDataPoint
{
    public RawVectorArrayUmapDataPoint(float[] data) => Data = data;

     public float[] Data { get; }

    /// <summary>
    /// Define an implicit conversion operator from <see cref="float[]"/>.
    /// </summary>
    public static implicit operator RawVectorArrayUmapDataPoint(float[] data)  => new(x);   

    /// <summary>
    /// Implicit conversation back to <see cref="float[]"/>.
    /// </summary>
    public static implicit operator float[](RawVectorArrayUmapDataPoint x) => x.Data;   
}

public class Umap : Umap<RawVectorArrayUmapDataPoint>
{
    public Umap(
            DistanceCalculation<RawVectorArrayUmapDataPoint> distance = null,
            IProvideRandomValues random = null,
            int dimensions = 2,
            int numberOfNeighbors = 15,
            int? customNumberOfEpochs = null,
            ProgressReporter progressReporter = null)
                : base(distance, random, dimensions, numberOfNeighbors, customNumberOfEpochs, progressReporter)
        {
             // ...
        }
}

The implicit conversion in RawVectorArrayUmapDataPoint should allow the library consumer to use the library as they were before, passing a float[] directly to a new Umap(...).InitializeFit(someFloatArray); without the generics.

amaid commented 9 months ago

@theolivenbaum I recommend squashing this PR when merging/closing.

I'm also aware that this is a breaking change to the library, which will need a bit of discussion and possibly a migration guide.

Or, maybe we should just ship an inbox non-generic implementation to polyfill what we had before? Something like:

public class RawVectorArrayUmapDataPoint : IUmapDataPoint
{
    public RawVectorArrayUmapDataPoint(float[] data) => Data = data;

     public float[] Data { get; }

    /// <summary>
    /// Define an implicit conversion operator from <see cref="float[]"/>.
    /// </summary>
    public static implicit operator RawVectorArrayUmapDataPoint(float[] data)  => new(x);   

    /// <summary>
    /// Implicit conversation back to <see cref="float[]"/>.
    /// </summary>
    public static implicit operator float[](RawVectorArrayUmapDataPoint x) => x.Data;   
}

public class Umap : Umap<RawVectorArrayUmapDataPoint>
{
    public Umap(
            DistanceCalculation<RawVectorArrayUmapDataPoint> distance = null,
            IProvideRandomValues random = null,
            int dimensions = 2,
            int numberOfNeighbors = 15,
            int? customNumberOfEpochs = null,
            ProgressReporter progressReporter = null)
                : base(distance, random, dimensions, numberOfNeighbors, customNumberOfEpochs, progressReporter)
        {
             // ...
        }
}

The implicit conversion in RawVectorArrayUmapDataPoint should allow the library consumer to use the library as they were before, passing a float[] directly to a new Umap(...).InitializeFit(someFloatArray); without the generics.

This is implemented.

Arlodotexe commented 9 months ago

You changed all the internal T to a hardcoded type (the new one), the code I provided works without making further changes to Umap<T>. We'll need to correct this.

amaid commented 9 months ago

You changed all the internal T to a hardcoded type (the new one), the code I provided works without making further changes to Umap<T>. We'll need to correct this.

This has been fixed, the implicit typecasting has been implemented to support explicit float[][] data to support existing class consumers without changing the existing implementation of Umap

@theolivenbaum The PR is ready to be merged. The changes have been tested for existing class consumers as well. The unit tests are passing.

image

Arlodotexe commented 3 months ago

Great, thanks @amaid!

@theolivenbaum Bumping