dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.03k stars 1.88k forks source link

What to do with VBuffer? #608

Closed eerhardt closed 5 years ago

eerhardt commented 6 years ago

VBuffer is a very critical, fundamental type in ML.NET. And as such, it has some critical performance considerations and characteristics it needs to adhere to.

However, it is not as intuitive of a type as it could be. We even wrote a Care and Feeding doc that describes some common pitfalls and other things to be aware of when using the type.

At its core, VBuffer has 2 responsibilities:

  1. It is a vector that can be either dense or sparse.
  2. It is a reusable/cached "buffer" to allow minimal allocations and garbage collections.

We should do some investigation into what types of improvements we can make to this type.

/cc @TomFinley @Zruty0

Zruty0 commented 6 years ago

As for the ArrayPool, there are some problems with this approach:

Zruty0 commented 6 years ago

To recap an online discussion: we should separate the two concepts of VBuffer:

For the first one, the plan is to get rid of polymorphous behavior, and formally separate into SparseVector and DenseVector

For the second one, the current approach is fine (i.e. DenseVector<T> will be a struct with two fields: T[] Values and int Length)

markusweimer commented 6 years ago

Retaining arrays if possible and tracking 'capacity' and 'logical length' of arrays.

Isn't this something where Span<T> could help? That is also designed to reuse the memory of differently-sized, array-ish accessors, right?

Serving as a polymorphous 'dense or sparse' vector

The closest thing to that in the BCL is Tensor<T>. Do we have ideas on how to merge VBuffer with that?

ericstj commented 6 years ago

You're DenseVector<T> actually sounds very much like Memory<T> except Memory<T> can also support an offset and be backed by a native pointer.

Another scenario we have for this is wrapping memory that's owned by someone else. Consider the TensorFlow case. TF will allocate tensors (and unfortunately doesn't seem to provide an hook for callback allocation or pre-allocation of output tensors) and the ownership of these tensors is transferred to the caller. For Python they are able to wrap the output tensor in an ndarray since ndarray permits wrapping memory: https://docs.scipy.org/doc/numpy-1.13.0/user/c-info.how-to-extend.html#c.PyArray_SimpleNewFromData.

Unfortunately since the tensor/vector type in ML.NET is VBuffer<T>, VBuffer<T> only exposes managed arrays, and managed arrays must be .net arrays on the managed heap we cannot do this.

To address this scenario consider changing the type that ML.NET exposes for tensor/vector data to something that can wrap a native pointer (like Memory/Span) or modify VBuffer itself so that it can represent data backed with a native pointer. I started experimenting with this idea a few months back in with Tensor.

I do think we can try to merge the ideas of VBuffer with Tensor. We may find that both have good ideas and we end up with a hybrid.

KrzysztofCwalina commented 6 years ago

I quickly searched through current ML.NET APIs and it does not seem like there are any APIs that return VBuffer (except some inconsequential factory methods). And so it seems like we could use Span<T> instead of Memory<T>. If we indeed could use Span<T>, I think it would be a non-trivial perf win. Not only because of Span<T> being faster than a wrapper over arrays, but also because it could allow us to avoid heap allocations and pooling in many cases, i.e. we would stackalloc Span<T>.

eerhardt commented 6 years ago

Issues

Here are the problems with the current VBuffer design:

  1. Values and Indices arrays may be null, which forces callers to be careful or else get a NullReferenceException.
  2. Length and Count properties are confusing. They seem to mean the same thing, but don't.
  3. The lengths of the Values and Indices arrays have no logical meaning to the vector, and should only be used when checking for capacity to reuse the arrays.
  4. Since Values and Indices are arrays, there is no safety that passing a VBuffer into a method won't modify the values in it.
    • This is the reason we have Span<T> and ReadOnlySpan<T>.

See the Golden Rules Section of the Care and Feeding doc as well. The below proposal will either fix or drastically reduce 4 out of the 5 rules listed there. (The exception is the "two distinct VBuffers sharing references to their internal arrays" rule.)

Proposal

Here's the proposed design changes I've come up with. Please provide any feedback to the design/names/etc.

  1. Keep Length, but hide/private Count. With proposed change (4) below, users can get the number of explicitly represented values by calling GetValues().Length.
  2. Make VBuffer a readonly struct.
    • It was always designed as readonly, we just didn't have the C# capability before now.
  3. In methods that only actually read values from a VBuffer, change that parameter from ref to in.
    • This guarantees that the method won't change the reference, as was always the intention. We just didn't have the C# capability before now.
  4. Change the public T[] Values and public int[] Indices to public ReadOnlySpan<T> GetValues() and public ReadOnlySpan<int> GetIndices().
    • This change fixes issues (3) and (4) above.
    • The Length of these spans will always be correct. For dense VBuffers, GetIndices() will be empty. For sparse VBuffers, GetValues() and GetIndices() will always return spans that have the same length.
  5. The most drastic proposed change is how to actually mutate a VBuffer.

VBuffer mutation

There are generally 2 types of VBuffer mutation:

  1. Changing the individual elements of the VBuffer - for example, scaling a vector.
  2. Reusing the underlying values and indices arrays to create a new VBuffer - for example, changing the LogicalLength or sparsity/density of the vector.

To solve both of these scenarios, we will introduce a new concept VBufferEditor<T>.

A VBufferEditor<T> is a lightweight, stack-only* structure that will:

  1. Give access to public Span<T> Values and public Span<int> Indices so code can modify the individual elements in the buffers.
  2. Ensure the Values and Indices buffers have the appropriate capactiy.
  3. Create new VBuffer instances using its cached buffers.

(*) This is stack-only so (a) it can cache the Span properties and (b) no one can hold onto an instance of the editor outside of the stack. The full mutation must happen immediately.

The following is a listing of the proposed public API:

    public static class VBufferEditor
    {
        public static VBufferEditor<T> Create<T>(
            ref VBuffer<T> destination,
            int newLogicalLength,
            int? valuesCount = null,
            bool keepOldOnResize = false);
    }

    public ref struct VBufferEditor<T>
    {
        public readonly Span<T> Values;
        public readonly Span<int> Indices;

        public bool CreatedNewValues;
        public bool CreatedNewIndices;

        public VBuffer<T> Commit();
    }

Here are examples of the typical usages from above:

  1. Changing the individual elements of the VBuffer

    public static void ScaleBy(ref VBuffer<Float> dst, Float c)
    {
        var count = dst.GetValues().Length; 
        if (c == 1 || count == 0)
            return;
        var editor = VBufferEditor.Create(ref dst, dst.Length, count);
        if (c != 0)
            CpuMathUtils.Scale(c, editor.Values);
        else // Maintain density of dst.
            editor.Values.Clear();
    }
  2. Reusing the underlying values and indices arrays to create a new VBuffer

    
    public static void ScaleBy(in VBuffer<Float> src, ref VBuffer<Float> dst, Float c)
    {
    ...
        var srcValues = src.GetValues();
        if (src.IsDense)
        {
            // Maintain the density of src to dst in order to avoid slow down of L-BFGS.
            var editor = VBufferEditor.Create(ref dst, src.Length);
            if (c == 0)
                editor.Values.Clear();
            else
                CpuMathUtils.Scale(c, srcValues, editor.Values, src.Length);
            dst = editor.CreateBuffer();
        }
        else
        {
            var editor = VBufferEditor.Create(ref dst, src.Length, srcValues.Length);
            src.GetIndices().CopyTo(editor.Indices);
            if (c == 0)
                editor.Values.Clear();
            else
                CpuMathUtils.Scale(c, srcValues, editor.Values, srcValues.Length);
            dst = editor.Commit();
        }
    }

Notice that to create a VBufferEditor you need a ref VBuffer. If you have an in VBuffer parameter in your method, it won't be possible to create a VBufferEditor since you can't pass an in parameter to a ref parameter. However, this isn't completely fool-proof because a method could copy the VBuffer to a local variable, and then use that as a ref to create a VBufferEditor. With this, the method could then modify the individual elements of the buffer. However, this is unlikely, and since the person writing the method has control of the method signature, if they really want to modify the buffer, they should change the signature to be a ref parameter.

Alternative Approaches Considered

As part of my prototyping and investigation I considered the following approaches, but they were rejected:

  1. Split VBuffer into DenseVector and SparseVector.
    • This was deemed as too much of an architectural re-design to too many parts of ML.NET. For example, public interface IValueMapper only allows for a single InputType. However, if vectors were represented by 2 different types, types like this would need to be changed to handle multiple input types (and output types).
  2. Create a ref struct ReadOnlyVBuffer type that VBuffer is implicitly convertable to, and use it in methods that shouldn't modify the VBuffer in any way.
    • Because of the stack-only restrictions of ref struct, this is actually a poor performing solution because of how many times VBuffers are held onto in either arrays, member variables, or passed through lambda expressions. The VBuffer code is very performance optimized today - using ref parameters everywhere so the values aren't copied on method call. Doing the conversion to ReadOnly is cheap, but isn't free. The KMeansAndLogisticRegressionBench benchmark showed up to a 10% perf loss with this solution.
  3. Create a regular readonly struct ReadOnlyVBuffer that VBuffer is implicitly convertable to.
    • This solution can be made to work within the existing performance characteristics of VBuffer. But after giving it some thought, adding a completely new type didn't seem necessary now that we have in parameters. Adding a ReadOnly type is extra complexity that I'm not sure outweigh the benefits.
TomFinley commented 6 years ago

Hello @eerhardt , thank you very much for writing this up. I am strongly in favor of the changes you have outlined for VBuffer, though you know this I'm sure. :) It seems to be as simple as possible, which is quite nice.

Let us therefore talk most about VBufferMutationContext. I also like this idea a lot, but since much of the motivation for this change is, as you say, to simplify potential misuse that we've seen over the years, even to the point of requiring the document you reference, we ought to try to make this as simple as possible as well. For that reason I am wondering if we can get rid of stuff.

out bool createdNewValues,
out bool createdNewIndices,
bool keepOldOnResize = false

I am somewhat curious about these parameters, because I'm wondering if we can somehow do away with them. It seems like keeping old on resize in the case of VBuffer specifically.

The out bool created... parameters are a bit more tricky. I am supposing, but do not know, that it is meant to service this sort of scenario:

https://github.com/dotnet/machinelearning/blob/d262171c827b70afb616cbe9397e372ee386c6c9/src/Microsoft.ML.Core/Utilities/VBufferUtils.cs#L406

Here we have a strong difference in implementation between the case where new data is created vs. old data, since we are modifying things in place. I wonder however, if this is the only case where the difference in implementation is so stark, since relatively few operations are in place in this fashion. If this is really the only place it is important, I might consider just making densify in the same assembly and considering it a privileged operation.

Let us now talk about this guy.

int? maxValuesCapacity = null,

I think what this is meant to accomplish, is to act like the max parameter here.

https://github.com/dotnet/machinelearning/blob/106a84cf2dc94b13f39d6f19da9f77c034e74d77/src/Microsoft.ML.Core/Utilities/Utils.cs#L833

However in this case at least, I think we can simplify the story a lot. The purpose of that argument is to distinguish between the case where we anticipate that future allocations over the same operation might grow the buffer even further, and cases where we anticipate that future allocations over the same operation will not. So for example: when tokenizing text, we would anticipate that we might have larger sentences/text in the future, so allocating a bit more extra space would not be wasted. Contrariwise, if we have a fixed size vector type, the getter for that in the dense case will -- the getter will never request a larger size than this, so, it would be pointless to ever request more, so this provides a hint.

This does I suppose represent a loss of capability, since there are places in the code especially VBufferUtils.cs where I got quite frankly rather too clever about the upper bounds:

https://github.com/dotnet/machinelearning/blob/d262171c827b70afb616cbe9397e372ee386c6c9/src/Microsoft.ML.Core/Utilities/VBufferUtils.cs#L388-L389

But, honestly I'm not sure this is really worth the extra complexity. If that actually winds up making any difference I'll eat my hat. :P

So I might say the only thing that is helpful to distinguish, is this case of "I won't need more than this" vs. "I might need more than this." Which could be accomplished with a bool, but I think we can do even better and just make the behavior implicit:

  1. In the case where we have an empty buffer (e.g., VBuffer foo = default or somesuch, that is, the first iteration), we can resize to the tight bound of precisely the capacity requested.

  2. In the case where we have an already existing buffer and need to make it larger, we might need to make it larger again down the line, so we can have a more anticipatory allocation (as was the default case here).

https://github.com/dotnet/machinelearning/blob/d262171c827b70afb616cbe9397e372ee386c6c9/src/Microsoft.ML.Core/Utilities/Utils.cs#L833

I think this will address the common usage patterns. This would also allow us to get rid of this parameter, I hope?

Other Questions

I had some other questions, each rather simpler than the above.

Also how is a VBuffer created in the first place? It of course suffices to create a default VBuffer and then immediately create a mutation context out of it, and there are some great advantages to having that be the way, but are there other ways? Should there be other ways?

When calling VBufferMutationContext<T>.CreateBuffer, it seems to me it would be nice if subsequent calls to it could somehow fail, or at least perform in some way that would not result in "buffer sharing."

Is it considered an error condition for VBufferMutationContext<T>.CreateBuffer to be called with a reference other than the initial input reference object? If we do set that object passed in to the default value, and the ref passed in here is non-default, this seems to me like it might represent a misusage of the pattern, since memory is necessarily then being wasted. So should we throw or disallow this by some other mechanism? And if so, should we just return a VBuffer, rather than use the ref idiom?

TomFinley commented 6 years ago

One thing that occurred to me -- might these pieces of auxiliary information like the out bool parameters happen just as fields on this mutating context? This would seem to obviate the need for the second overload, and those operators that do not care about this operation would just be able to simply ignore those fields.

eerhardt commented 6 years ago

Thanks for the comments, @TomFinley. They are appreciated. I've updated the propsoal above as appropriate, and am responding to your questions/comments below:

It seems like keeping old on resize in the case of VBuffer specifically.

This option is important when expanding the buffer to make room for new elements. Here are two cases that do this:

https://github.com/dotnet/machinelearning/blob/5e08fa1ea7bfb54f28ed0815cb6413e0068e6dd1/src/Microsoft.ML.Core/Utilities/VBufferUtils.cs#L478-L481

https://github.com/dotnet/machinelearning/blob/bee7f171941007b29671b5435ad135608fcda930/src/Microsoft.ML.Data/Depricated/Vector/VBufferMathUtils.cs#L278-L291

That way the caller doesn't have to do the copying themselves. Array.Resize will do the copying underneath the covers. Notice that I explicitly choose a different default value from Utils.EnsureSize. VBufferMutationContext uses keepOldOnResize = false, since most of the scenarios don't need to preserve the old values, and it is faster to create a new array than resize, which copies the elements.

The out bool created... parameters are a bit more tricky. I am supposing, but do not know, that it is meant to service this sort of scenario:

Correct. There is another place that uses it: https://github.com/dotnet/machinelearning/blob/bee7f171941007b29671b5435ad135608fcda930/src/Microsoft.ML.Data/Depricated/Vector/VBufferMathUtils.cs#L371-L372

So I thought it would be important to preserve this option, somehow.

I just looked, and I haven't gotten to a place yet where I have used createdNewIndices, but I haven't gotten through everything yet. It just seemed symmetrical to have both options.

might these pieces of auxiliary information like the out bool parameters happen just as fields on this mutating context?

Yes, I can totally make that change. It should clean the API up.

maxValuesCapacity

Yes, this gets passed into the max parameter of Utils.EnsureSize.

There appears to be a mix of usages here in the existing code.

Some specify the LogicalLength for the max capacity, ex:

https://github.com/dotnet/machinelearning/blob/bee7f171941007b29671b5435ad135608fcda930/src/Microsoft.ML.Data/Depricated/Vector/VBufferMathUtils.cs#L370

and some let it default to Utils.ArrayMaxSize, ex:

https://github.com/dotnet/machinelearning/blob/8e77d0f0ce74cdb8b10fa545830972a4525d5dab/src/Microsoft.ML.Data/Data/RowCursorUtils.cs#L448-L449

Part of me thinks keeping newLogicalLength as the maxCapacity makes sense, especially in the case where you aren't ever going to need to allocate again. That way, if you only need 1 more element, you aren't doubling the array size. But, in the opposite worst case scenario, where each mutation grows the vector by a single element, it will allocate a lot of new arrays, and you might as well have doubled it the first time you needed more capacity.

After some thinking, my gut is telling me to drop the parameter, and let the array double (your proposal above), for now. If we run into problems, we can add an overload/default parameter in the future.

Thoughts on that plan?

Also how is a VBuffer created in the first place?

I wasn't planning on changing the existing constructors to VBuffer. Let me know if you think that is a problem.

VBufferMutationContext.CreateBuffer

I think it makes sense to change this to return the buffer instead. After that, we can "poison" the mutation context, and throw an exception if someone calls CreateBuffer again.

If we do set that object passed in to the default value

I'm still not convinced we should do that. I think it is surprising to a caller. And there are some cases where we'd like to still access the VBuffer after creating a mutation context, ex:

https://github.com/dotnet/machinelearning/blob/bee7f171941007b29671b5435ad135608fcda930/src/Microsoft.ML.Data/Depricated/Vector/VectorUtils.cs#L162

It also isn't fool-proof, because someone could copy the VBuffer on the stack, and even if we blanked out the ref being passed in, they would still have access to the buffer with the copy.

Would the above plan to have CreateBuffer return the created VBuffer work for now, or do you still think that we should be blanking out the ref being passed in?