Enhanced support for single-table design

fsprojects / FSharp.AWS.DynamoDB

F# wrapper API for AWS DynamoDB

MIT License

58 stars 18 forks source link

Enhanced support for single-table design #59

Open matti-avilabs opened 1 year ago

matti-avilabs commented 1 year ago

This is somewhat related to the caveat mentioned in the TransactWriteItems section regarding multi-table operations not being supported.

For a single-table design this limitations also pops if there are multiple instances of TableContext using the same table name, and a need arises to write more than one record type in the same request:

let baseContext : TableContext<BaseSchema> = ...
let contextFoo : Context<Foo> =  baseContext.WithRecordType<Foo>()
let contextBar : Context<Bar> =  baseContext.WithRecordType<Bar>()

let fooItems : Foo list = ...
let barItems : Bar list = ...

async {
  // Separate DynamoDB requests are needed here
  let! _ = contextFoo.BatchPutItemsAsync fooItems
  let! _ = contextBar.BatchPutItemsAsync barItems
}

The most straightforward way of resolving the immediate problem would be to make the AttributeValue helper methods on RecordTemplate here public, and from there just use the SDK to make the necessary request.

If there is another approach that offers a solution that doesn't require working with the SDK directly, and maybe solves the problem more generally for transactions as well I'd be happy to take a look, just let me know.

bartelink commented 1 year ago

My instinct would be to try to add a BatchWriteItem call which takes a seq<BatchWrite> enabling arbitrary mixing of Put and Delete calls, but with the final request dispatch and metrics managed by a single API, not dissimilar to the way that TWI is implemented (except it's hard wired to one TRecord, see below). Something approx like:

let baseContext : TableContext<BaseSchema> = ...
let contextFoo : Context<Foo> =  baseContext.WithRecordType<Foo>()
let contextBar : Context<Bar> =  baseContext.WithRecordType<Bar>()

let fooItems : Foo list = ...
let barItems : Bar list = ...

async {
  let! residual = baseContext.BatchWrite( [contextFoo.Batch.Put fooItems; contextBar.Batch.Put barItems])
}

The baseContext would define where the metrics go, but not the tableName. A BatchWrite module can map the high level typed requests down to tuples with e.g. table names and attributes, with TableContext.BatchWrite composing them into a single BatchWrite request.

It should then be possible to rebase the existing BatchPut/Delete on top of that API.

(It might also make sense to deprecate the original Transact API and provide a .Transact equivalent of each .Batch construct, but that can be a separate PR. If you have any ideas, I have a basic usage in Equinox, but its only single table and single schema so is obviously not going to be the best validation of the flexibility of such an API)

In general I'd say Sam's stance is not to go down the road of exposing low level stuff like `AttributeValue if that can be avoided, and I tend to agree.

Not sure how one'd exactly map the residual ones back to the same request structure in a useful way though.

I guess at the point where there are multiple Tables involved, things get messier - how do you pick the context to issue it through. Something like this would be discoverable in what I imagine is the way most people would find it (dotting in an IDE)

  let! residual = baseContext.BatchWrite( [contextFoo.Batch.Put fooItems; contextBar.Batch.Put barItems])

Removing the ambiguity of the redundant tableName and TRecord implied by baseContext (but way less discoverable) would be something like:

let dynamo : DynamoContext = DynamoContext(client : IAmazonDynamoDB, ?metricsCollector : RequestMetrics -> unit)
let! residual = dynamo.Batch.Write( [contextFoo.Batch.Put fooItems; contextBar.Batch.Put barItems])

samritchie commented 1 year ago

I'm not sure I'm sold on supporting single-table design directly in the library - I can see the appeal, and it’s definitely something I've (briefly) looked at in the past, but my gut feel is that as it stands, this is better done by using a single record type corresponding directly to the table, with your multiple type mappings in a layer on top of that.

However, I'm open to having a discussion about how proper support would work & what should be supported. I don't use single table myself so only have a high-level understanding of how it’s done in practice. Things to consider are:

Table creation & index updates across multiple records
Record discrimination approach, is this mostly done via a composite hash key or separate discriminator attribute? (if the latter this is going to complicate Get operations and probably other scenarios)
Are single-type Scan operations needed (ie an automatic filterCondition and I guess, projection)?
Are multiple-type Scans/Queries ever used, ie via a single-record-cases DU for the table?
What should happen if a single-type Query picks up other types (ie a shared index) - is this another automatic filterCondition scenario?
Do we need to limit projected fields in single-type index creation/update?

There’s a lot of behaviour here that would somehow need to be generically baked into the library and communicated to consumers, whereas I think it’s almost certainly going to be clearer to implement this in your own regular F# code (I suspect most people don't need address most of these) and leave the library to just handle the mechanics of Dynamo comms. If people find they’re implementing the same patterns over & over on their single tables, this is perhaps something that we can look at generalising into a separate helper library?

I realise the body of this issue isn't really about full single-table support - I think there would be value in redesigning Batch Write to properly support multiple tables, not just multiple contexts on a single table. I will create a separate issue.

bartelink commented 1 year ago

The new issue is slightly confusing as posed in that it's title talks about Transact, but the example in the body is doing Batch stuff.

The reason I brought that into this issue is primarily to ensure that Batch and Transact features being added use a similar style where possible for reasons of discoverability.

As and when the other issue yields workable Batch support, supporting multi-Table Transact calls can follow its lead (possibly by also adding a .Transact.<Op> series of helpers to construct TransactWrite DU items vs the ugly use of table.Template.PrecomputeConditionalExpr etc in https://github.com/fsprojects/FSharp.AWS.DynamoDB#transactwriteitems).

matti-avilabs commented 1 year ago

Thanks for taking the time to consider this. Don't have much to add unfortunately and can certainly understand the reasoning behind keeping lower level stuff internal.

Just to provide some context - we basically have a fork of the library where we use the TableContext API as much as possible, but dip into the RecordTemplate stuff in combination with the SDK for non-standard cases that don't appear to be achievable via TableContext. So from my perspective there seems to be a lot of value in some type of standalone project/module that solves the problem of mapping records to AttributeValue key-values. But I'm not super familiar with the F#/C# ecosystem so perhaps this type of stuff already exists out there.

bartelink commented 1 year ago

It's really useful to have specific cases like that surfaced. Can you perhaps illustrate some of those custom usages (or is it all Batch Puts?)

Also, wrt the batch requests, do you have any views on what would be useful/necessary in terms of returning the results (unprocessed requests)?

One thought I had was potentially to have a .Internal property on TableContext alongside .BatchWrite and .TransactWrite as an explicit way to expose relevant AttributeValue stuff - both as a way to avoid cluttering the top level API, and a way to say "should you really be doing that".

But if you're guaranteeing not to do binary breaking changes down that path, that doesn't really change the equation; you obviously need to spend a lot more time thinking about whether that makes future changes problematic.

matti-avilabs commented 1 year ago

My thought here is basically that there seems to be a lot of really useful stuff that would help with building out requests for the SDK that's not directly accessible. Don't really have any concrete suggestion here, just wanted to point out that this stuff may be useful in a vacuum!

An example here might be if I really needed multi table transactions. It feels like most of the heavy lifting has already been done by the library, relieving most of the headache involved with constructing such a request:

let tableCtxFoo : TableContext<Foo> = ...
let tableCtxBar : TableContext<Bar> = ...

let tableKey = TableKey.Combined(...)
let fooUpdate = Amazon.DynamoDBv2.Model.Update(TableName = tableCtxFoo.TableName, Key = tableCtxFoo.Template.ToAttributeValues(tableKey))
let updater = tableCtxFoo.Template.PreComputeUpdateExpr <@...@>
fooUpdate.ExpressionAttributeNames <- toDictionary updater.Names
fooUpdate.ExpressionAttributeValues <- toDictionary updater.Values
fooUpdate.UpdateExpression <- updater.Expression
let conditionalExpr = tableCtxFoo.Template.PrecomputeConditionalExpr <@...@>
fooUpdate.ConditionalExpression <- conditionalExpr.Expression

let barUpdate = ...

let transactReq = Amazon.DynamoDBv2.Model.TransactWriteItemsRequest()
transactReq.TransactItems <- [fooUpdate, barUpdate]

Not a perfect example, but hopefully it illustrates how a lot of the stuff is almost there to go off on your own, but not quite 😄

arnarthor commented 1 year ago

A more concrete example for the usability of having the serialisers public is the usage of DynamoDB streams. Currently if we write to a table using the table context it will get automatically serialised with the internal picklers.

Once this same data hits the DynamoDB Stream, we can't deserialise it again because the TableContext serialisation code is all private in the library.

We've been mitigating this by storing Protobuf objects in our tables, but that comes with a bunch of different tradeoffs, biggest one being looking at objects in the table can't be done by humans without creating a script that fetches items by keys and parses the proto model and then pretty printing it. Not a huge issue, but still a barrier for quick and dirty debugging sessions.

I think having the serialisers publicly available in this library would open up to a ton of useful features like @matti-avilabs mentions above, but the example of the Dynamo streams is now limiting our use for the library because of it's "proprietary" data format.

samritchie commented 1 year ago

Sorry, I've been meaning to respond to this for a while but it’s taken some time to wrap my head around the best course of action.

My starting position is that the value in a library like this is mostly in its abstractions - being able to get up & running quickly and interacting with a sensible/consistent/logical API surface area, particularly given the underlying Dynamo SDK is so awful. When issues are raised requesting to make internals public, it feels like this is essentially asking "please make the library abstractions leaky so I can implement my own abstractions on top". If this approach is adopted, the long term outcome tends to be that the library becomes much less obvious & more difficult to use, and common tasks and patterns end up being routinely implemented via copy & paste of sample code.

That’s not to say that the issues being raised aren't entirely valid & worthwhile, but I'd much prefer to approach them from the direction of 'the library's abstractions are incomplete' rather than 'the library’s abstractions are not leaky enough' 😉.

With that in mind, can we keep this issue for discussion of what single-table support should look like, and I'll raise a new issue to support streams?

samritchie commented 1 year ago

I've said above that I felt the best way to implement single-table was probably via a single context with a base record of optional fields and application-layer mapping code. However, I acknowledge that there’s a lot of boilerplate there that’s avoidable with the right single-table abstraction. I'd like to understand a bit more about how people use single table in practice to see if/how this could be done.

Note I don't think the library should necessarily support all possible single-table approaches, imo the goal should be an opinionated single-table implementation that allows devs to get a solution up & running quickly.

So some of the questions we'll need to answer (somehow) are:

What key structures are typically used?
How are discriminators implemented - I'm guessing these are predominantly part of a composite hash key? Are they always strings?
How many different record types would you typically see in a single table? What is the max number that can/should be sensibly supported
What query patterns are generally used - are they nearly always just Gets, or are LSIs/GSIs frequently used for single-record queries
Are multi-record queries/scans ever done, or are they always fetched separately?

If there are any good resources out there for designing single-tables, or publicly available real-world implementations, would be great to see them.