Descriptor Set Optimizations: Batch Inserts

ifadams commented 1 month ago

Is your feature request related to a problem? Please describe. In the current API, we are restricted to inserting one feature vector per query. This is inefficient as it leads to both a proliferation of queries and inefficient one-at-a-time calls to VCL internally.

Describe the solution you'd like Instead, we should be able to insert multiple feature vectors within a single call.

Additional context

The current proposed solution is a tweak to the existing API such that we have optional fields that specify the size of the incoming vector(s) and the number that are packed into the blob.

This should allow legacy applications to continue to use the AddDescriptor call as-is, but modified apps can make use of the new fields and vector packing capability to be able to batch multiple inserts into a single call.

ifadams commented 1 month ago

@s-gobriel @araghuna1

Starting to stub this out and realizing we may still need tweak to the external API to have an optional field where a user can specify the number of expected incoming vectors that are packed together.

This is because we can otherwise run into situations that are ambiguous. e.g. A client could send a 256 byte blob to descriptor set that expects 128 bytes per vector, and we have no way of telling if thats 2 separate vectors versus 1 large one just by math checks.

s-gobriel commented 1 month ago

Agree. tweaking the API to include an explicit number of vectors removes any ambiguity and will allow the code to throw errors if vectors with wrong dimensions are sent to the server.

araghuna1 commented 1 month ago

ohh.. I thought we had discussed getting the expected vector length (by requiring an additional parameter in the AddDescriptor or earlier AddDescriptorSet call). so yes, that would be a change to external API.

Are you saying that if we know the expected FV length, it still won't be possible to tell number of vectors in a blob?

s-gobriel commented 1 month ago

You are correct, the vector length (dimension) is a parameter that is set when the index is initiated.

It is possible implicitly know the number of vectors because the dimensions is constant per descriptorset, so a simple math of dividing the blob length with the dimensions should yield the number of vectors. However, if client sent (in error) adds with vectors of length (X*dimension) it will be treated as adding X vectors instead of throwing an error that of dimension mismatch

ifadams commented 1 month ago

The goods here is that this can be an optional field, so if we dont see someone specify the number of vectors, we can implicitly assume its only 1, and throw an error if its not a dimension match.

There are two other snags I'm realizing, and need to rope @cwlacewe in here for her opinion on this as well

one is easy enough to address, the other is going to be problematic (@s-gobriel and @araghuna1 should chime in too)

1) If we include a link block as part of the query, we are implicitly saying "all vectors we are adding within this AddDescriptor call will be connected via the specified link logic". I think we can just document this as known behavior, and the implementation of its shouldn't be too bad. We can do the same for any added properties.

2) More complicated one: Its possible to have references back to the vector(s) as a valid call. This is more problematic. This takes a moment to explain so bear with me.

Short version (before the long rambling explanation): we probably want to disable use of the _ref tag for batch inserts as it gets extremely messy, very, very quickly. Combination of peculiarities of VDMS and PMGD in conjunction make this very, very tricky to try to leave in. I also want to make it clear, that for single vector inserts, it would work as is, only disabling it for batch inserts.

the VDMS API allows a "ref" tag, which lets us do things like daisy chain and refer to prior queries for things like creating links within a single transaction. So I could have a reference of "123" for adding a descriptor/vector, and later could refer to link to the FV entity by an explicit AddConnection call later in the transaction.

The problem here, is that a reference implicitly only refers to the results/actions of a single query. However, if we insert a batch of vectors We dont want to have multiple queries using the same reference tag as it gets very confusing (trust me) and possibly breaking things.

There's a temptation to internally say "Okay, fine, we'll do some logic to get the same effect" i.e. we'll using a scheme to make sure they all get unique "ref" tags, but we now multiply the problem.

Lets say I insert five vectors in a VDMS level call with a _ref tag of 123. Internally we turn this into 5 separate insert calls to PMGD. We could have something (metaphorically) like "okay, lets have tags 1231, 1232, 1233, 1234, 1235" to now represent those calls. However, if a subsequent call like uses that reference to create a link or filter, we now have to have 5 separate queries again that we have to internally manage.... and even worse, if we had a reference to this follow on query, we're now in an potential explosion as each of those follow on queries will also have to have their own new set of references....

I realize this is a lot of mess, but I hope it gets across the point that this is something I dont think we can programmatically get our way out of. The only solution I think is to say "If you use a batch insert, you cannot make use of the references as they cannot be disambiguated internally".

araghuna1 commented 1 month ago

2. More complicated one: Its possible to have references back to the vector(s) as a valid call.

Excellent catch Ian! I gave it some thought too and (as long as we try to address this outside of PMGD) I agree with your assessment that this can get messy. Hence, disabling use of _ref for batch inserts sounds ok to me. Since creating a link the other way around is possible, hopefully any needs to create connections to the descriptors added in a batch is possible (maybe by modifying the order of steps in a transaction). We can revisit if we find some use case that just cannot work without _ref.

araghuna1 commented 1 month ago

Another related complication: If there is a need for unique meta properties per descriptor (we have a use case requiring this, right @cwlacewe ?) can this be handled? I suspect handling this need might require more client visible API changes (like a list of values per property..?)

ifadams commented 1 month ago

RE: the unique metadata properties per descriptor, oof.... yeah.... we could theoretically modify the API to allow effectively a list of property blocks to be included within the add descriptor calls. Need to play with the API definition JSON a bit to see if this is possible/how hard it is.

Though I'm now actually wondering if this might make a decent "batch" interface more generally for images as well, but lets shove that thought further out....

ifadams commented 1 month ago

@s-gobriel @araghuna1 Okay, so continuing to poke at this, few updates for your guys to ponder based on these conversations.

What I'm thinking is we extend the AddDescriptor API slightly for a "batch mode".

To enable "batch mode" we make use of a new optional "PropertiesList" field (or some other name). This should let us specify multiple unique properties for each FV in a batch, rather than requiring that each FV have the same property duplicated. The presence of this field will implicitly say "We're in batch mode".

Something like [FV1:{prop1:value1, prop2:value2}, FV2....]

By doing it this way, we get a count of vectors to expect (i.e. the length of the list), which we can use to calculate if the blob thats paired with it is of the correct length.

If we see this field, we also can implicitly disable/throw errors if someone tries to use a _ref tag.

If the PropertiesList field is not present, everything works just like it already does.

Mind you, this all assumes I can figure out the API defintion tweaks.... nothing looks quite like this yet so I'm doing some guess and check here.

ifadams commented 1 week ago

Closing as complete with PR to develop done.

IntelLabs / vdms

Descriptor Set Optimizations: Batch Inserts #175