fsprojects / FSharp.Azure.Storage

F# API for using Microsoft Azure Table Storage service
MIT License
75 stars 16 forks source link

Batch operations require all records to be of the same type. #4

Open Bananas-Are-Yellow opened 9 years ago

Bananas-Are-Yellow commented 9 years ago

It looks like inTableAsBatch and inTableAsBatchAsync both take a sequence of Operation<'T>. This means that all the records I want to batch-insert have to be of the same type. This happens to be the case with your example of inserting 200 Game entities, but this is not the case for my situation.

If I understand Azure Table Storage correctly, the requirement for a batch insert is that all entities have the same partition key, however, the other properties present could be different. Is that correct?

In memory, I have a graph structure:

type Child = ...

type Parent = {
    Children: Child []
    ...
}

I can't store this directly in Azure table storage, so I have to define entity types:

/// represents a parent
type ParentEntity = {
    /// parent guid
    [<PartitionKey>] PartitionKey: string
    /// "" (empty)
    [<RowKey>] RowKey: string
    ...
}

/// represents a reference to a child
type ParentChildEntity = {
    /// parent guid
    [<PartitionKey>] PartitionKey: string
    /// child number
    [<RowKey>] RowKey: string
    /// child guid (PK of child)
    Child: Guid
}

I want to insert one ParentEntity row and many ParentChildEntity rows. These are logically one insert, since they all belong together, and they all have the same partition key.

How should I do this?

daniel-chambers commented 9 years ago

You currently can't do it short of using ITableEntity-derived classes instead of F# record types.

I'm planning a feature that would allow you to define a discriminated union:

type ParentOrChild = 
    | Parent of ParentEntity
    | Child of ParentChildEntity

which you could then use to insert multiple types:

let entities = [ Parent({ ... }); Child({ ... }) ]
entities |> Seq.map Insert |> autobatch |> List.map insertInMyTableAsBatch

An extra column would be added to your table that would contain "Parent" or "Child" in order to guide the deserializer to use the correct type during querying.

But I don't have a timeline to share for that feature at this point.

Bananas-Are-Yellow commented 9 years ago

I think that would be a good enhancement, but based on my other topic about not wanting to duplicate the partition key and row key, you can predict what I'm about to say. Likewise here I would not want to store "Parent" or "Child", because this is duplicating information I already have.

I can already tell which entity type it is based on the row key. In this example the partition key is the same for both, to allow for batch updates, but in general in any situation like this, I think the two keys together should encode the union case. So I would like to pattern match these two keys and then tell you which case to deserialize, rather than storing the case in a column.

Meanwhile I will use the ITableEntity approach as a workaround.

What is your timeline for enhancements generally? Are you actively working on this package currently, or are you busy with other things? Do you plan to add blob storage (e.g. code for splitting a stream into blocks for upload)?

daniel-chambers commented 9 years ago

I don't see the extra column as an issue at all. Table Storage is insanely cheap, so one extra column that you don't even see or need to maintain from a code perspective isn't an issue in my opinion. However, I'll try to keep in mind your request to be able to customise this process in the design the feature. I should be able to design in an extensibility point for you to replace the default behaviour.

I work on the project outside of work in my free time, so it gets done when it gets done. :) Blob storage support is further off than this particular feature; I haven't even started thinking about a design for blob storage yet.

Thanks for your feedback! :)

Bananas-Are-Yellow commented 9 years ago

You are right of course, table storage is insanely cheap, and maybe there are other things I should care more about. Just as a point of reference, the commercial application I'm building is a cloud-hosted CAD system for mechanical design. It will be a freemium offering, with perhaps 90% of people using it for free and 10% of people paying for additional functionality. I have to pay for the data that the free users are creating, which is why I am being picky about this topic. I may well be worrying unduly.