fsprojects / FSharp.Azure.Storage

F# API for using Microsoft Azure Table Storage service
MIT License
75 stars 16 forks source link

Adding support for serialized properties #14

Open eiriktsarpalis opened 8 years ago

eiriktsarpalis commented 8 years ago

Hi,

Let me start by saying that this is a great library, we are investigating on adopting it for the implementation of MBrace.Azure, to replace direct manipulation of the cumbersome TableEntity API.

I was wondering if it would be possible to add support for serialized properties. For instance

Config.serializer <- JsonSerializer() // specify some kind of serializer implementation to the global state

type Record =
    {  [<PartitionKey>] PartitionKey : string
       [<RowKey>] RowKey : string
       Name : string
       [<Serialized>] Data : (string * int) [] }

which would store the Data property using a json string, depending on the specified serializer implementation. Adding this feature comes with the obvious caveat that queries cannot incorporate such properties, but this is not necessarily a problem. Thoughts?

isaacabraham commented 8 years ago

The biggest issue with sorting serialized data in Table storage is the restrictions on maximum size of individual fields, and maximum row size in general. The library would ideally want to perform checks on both of these attributes before doing an insert our update.

The way that MS recommend dealing with such data (particularly if it is large) is to store a pointer in the table to a blob that contains the serialized data.

eiriktsarpalis commented 8 years ago

True, this bears the assumption that the serialization is small enough to fit in the table. Obviously, the same argument could be made about strings which can be arbitrarily large.

daniel-chambers commented 8 years ago

I'm agreeable to the idea of supporting some way of serializing non-standard types. As for worrying about the size of the serialized data, I tend to concur with @eiriktsarpalis: the library would handle that in the same way that it expects the user to be responsible with large strings (though admittedly the risk of overflowing is greater). Implementing an abstraction for externalizing the data into blob storage opens a very large can of worms, and my gut feel on that is that it would end up being quite a leaky abstraction which is probably better off being handled in application code explicitly.

I'm interested in use cases around how serializers would be configured. The example in the OP configures a single global serializer for all types. This seems too restrictive... in my head I'm anticipating the need for different serializers for different types. Any thoughts on this front?

eiriktsarpalis commented 8 years ago

Serialization configuration could possibly be encapsulated in a session object; one that also includes the TableClient instance.

Another design direction would be to have two serializer attributes/interfaces, one for binary and one for text-based serialization. The former obviously better suited for size efficiency and the latter for readability/debugging. For example:

type ITextSerializer = 
    abstract Serialize : 'T -> string
    abstract Deserializer : string -> 'T

type IBinarySerializer =
   abstract Serializer : 'T -> byte []
   abstract Deserializer : byte [] -> 'T 

type Record =
    {
        [<TextSerialized>] Metadata : (string * int) []
        [<BinarySerialized>] Type : System.Type
    }

As far as serialization sizes are concerned, it is easy for the library to check that the pickles are within the row limit.

eiriktsarpalis commented 8 years ago

Btw, I am very keen in actively contributing to these changes if you like.

daniel-chambers commented 8 years ago

The attribute idea is good, however I don't see the need for binary/text interfaces. I think we only need a single interface like this:

type ICustomSerializer =    
    abstract Serialize : 'T -> obj
    abstract Deserialize : obj -> 'T

type Record =
    {
        [<CustomSerializer(typeof<MyTextSerializer>)>] Metadata : (string * int) []
        [<CustomSerializer(typeof<MyBinarySerializer>)>] Type : System.Type
    }

Internally when we're handed a property's value by the Azure SDK, we get it as obj and deal with it; similarly when we write, we give it to the SDK as obj. So if a serializer is jacked in front/behind of that process and marshals from T -> obj and back again, I reckon it could slot in pretty cleanly. Thoughts?

Once we've nutted out a good surface for this, I'm very happy to guide with and accept tested pull requests if you're raring to implement this ASAP. :)

As for serialization sizes, yes, we could measure each property's size. However, I imagine table storage itself will reject properties that are too large; is it worth writing our own code to measure this?

isaacabraham commented 8 years ago

I think as long as you come back with a reasonable error message it's ok - you know as well as I do that azure table errors are awful by default, particularly when occurring as part of a batch.

Preferable solution would be to check before sending the data to write I suppose - you'd need to check individual property sizes and ideally the entire row size as well.

eiriktsarpalis commented 8 years ago

@daniel-chambers I like it, it's much simpler that way. It's also easier to create serializer-specific attributes by inheriting the CustomSerializer attribute.

daniel-chambers commented 8 years ago

@isaacabraham It's probably worthwhile investigating what exactly table storage returns when properties are too big, and if we can leverage that for better errors, great, that'll save us the effort of maintaining the limits inside the library and performing the computation against all properties every time. (What if they increase the limits in the future, etc)

daniel-chambers commented 8 years ago

I think we should be able to support serialized properties as values in queries too.

Query.all<Test> 
|> Query.where (fun t s -> t.SerializedProperty = (1, "test")) 
|> fromSomeTable

The above should be okay, since the value of the comparison expression can be evaluated, serialized and then the serialized value used as the value in the conditional. However, drilling into a serializable property would not be supported at all:

Query.all<Test> 
|> Query.where (fun t s -> t.OtherSerializedProperty.Prop = "no!") 
|> fromSomeTable

In other words, comparisons of entire serializable properties against values should be okay, I reckon, but the property side of the comparison operation must only ever select a property directly on the record type itself; no drilling in allowed.

daniel-chambers commented 8 years ago

A use case for the above is if you want a custom serializer that transforms, for example, a long into a zero-padded string representation of long.MaxValue - myLong (for reverse sorting). That serializes cleanly back and forth to that string and should be transparently usable in queries.

I don't imagine people would use it where they're serializing objects into JSON, though it probably would work to an extent.

isaacabraham commented 8 years ago

Agree re: Checking the errors Azure returns etc. - I just remember that they are a pain to deal with as particularly in batches errors are not great.

eiriktsarpalis commented 8 years ago

@daniel-chambers I guess equality-based queries should be ok, but in order to be able to perform comparisons the serializations need to preserve ordering. In other words, a <= b if and only if serialialization(a) <= serialization(b). This is often not the case. Consider for instance the string representation of ints, where "-1" < "-2" and "2" > "1000".

daniel-chambers commented 8 years ago

That's a good point. My thoughts on this is that serializer authors can ensure that themselves. Perhaps we could make them put an attribute or a property on the serializer to declare they support order comparisons. If that declaration is not present we can block <, > etc with a helpful exception.

eiriktsarpalis commented 8 years ago

The more I think about this, the more I find exposing such functionality to be problematic. In certain cases, you cannot even assume that serializations preserve equality. Consider objects that use reference equality or even graphs with recurring instances up to reference equality. For example:

let a = let x = [1] in (x,x)
let b = ([1], [1])

Clearly, a = b as far as .NET is concerned however their serializations tell a different story:

let pickle x = 
    use m = new System.IO.MemoryStream()
    let ndc = new System.Runtime.Serialization.NetDataContractSerializer()
    ndc.Serialize(m, x)
    m.ToArray()

pickle a = pickle b // false

I think that any user would be inclined to believe that language-integrated queries ought to reflect the equality and comparison semantics of the materialized .NET object. If this is not the case, confusion almost certainly ensues.

daniel-chambers commented 8 years ago

I see your point that it's dangerous to assume that the serializer generates stable values for comparison purposes. But I'm loathe to say that as soon as a serializer is involved, you can no longer query against that property. It's possible to create serializers that are stable (and whose values can be compared).

I think a fair compromise is to block comparison and equality of serializer-ed properties by default, unless the serializer declares support for those two operations (separately), in which case the library can trust that the serializer implementer knows what they're doing. This way we stop people from experiencing unexpected behaviour, unless they opt-in.

eiriktsarpalis commented 8 years ago

True. In that case I think the best approach is to use the stable representation directly as the record type.

daniel-chambers commented 8 years ago

Are you saying you'd reckon it'd be better to just not support serialized properties in queries, and if someone wants that then they need to do it themselves manually rather than use serializers?

eiriktsarpalis commented 8 years ago

Yeah, I think this would enforce a degree of sanity.

daniel-chambers commented 8 years ago

Hmm, well I guess we can leave it out in the first version, and if we find that people have good use cases for adding it later, we can add it without breaking changes; so that's okay. If nobody complains, then nobody was probably wanting it anyway :)