calrissian / mango

Common utilities for rapid application development
Apache License 2.0
17 stars 7 forks source link

"Short Alias" for type encoders #193

Open cjnolet opened 8 years ago

cjnolet commented 8 years ago

The alias for the type encoders takes up quite a large footprint when persisted to disk. Each time I need to encode an integer, I basically need to write out "integer" which is 7 bytes. I propose we create a "short alias" for each encoder that returns a single byte.

eawagner commented 8 years ago

I don't know if this would be a good idea to bake into the current API.

We have done this on other projects like RYA which uses an unencoded integer to represent types. Also on one of our projects we needed to use XML schema identifiers for data types (much larger aliases). Basically what we did for each of these was to just have have a utility that created a Wrapper for the type encoders which overrode the value returned for the aliases. Then created a TypeRegistry where we just wrapped all the encoders.

private static <T, U> TypeEncoder<T, U> changeAlias(final TypeEncoder<T, U> encoder, final String alias) {
        return new TypeEncoder<T, U>() {
            @Override
            public String getAlias() {
                return alias;
            }

            @Override
            public Class<T> resolves() {
                return encoder.resolves();
            }

            @Override
            public U encode(T value) {
                return encoder.encode(value);
            }

            @Override
            public T decode(U value) {
                return encoder.decode(value);
            }
        };
    }

In this case you can just make the alias "\u0000" to specify the equivalent for an integer value of 0.

cjnolet commented 8 years ago

That's quite a lot of work just to provide a minimized type name... I've gotta wrap the entire API and every single type encoder in that case vice just adding something new to each encoder to minimize the aliases.

On Thu, Sep 17, 2015 at 2:40 PM, eawagner notifications@github.com wrote:

I don't know if this would be a good idea to bake into the current API.

We have done this on other projects like RYA which uses an unencoded integer to represent types. Also on one of our projects we needed to use XML schema identifiers for data types (much larger aliases). Basically what we did for each of these was to just have have a utility that created a Wrapper for the type encoders which overrode the value returned for the aliases. Then created a TypeRegistry where we just wrapped all the encoders.

private static <T, U> TypeEncoder<T, U> changeAlias(final TypeEncoder<T, U> encoder, final String alias) { return new TypeEncoder<T, U>() { @Override public String getAlias() { return alias; }

        @Override
        public Class<T> resolves() {
            return encoder.resolves();
        }

        @Override
        public U encode(T value) {
            return encoder.encode(value);
        }

        @Override
        public T decode(U value) {
            return encoder.decode(value);
        }
    };
}

In this case you can just make the alias "\u0000" to specify the equivalent for an integer value of 0.

— Reply to this email directly or view it on GitHub https://github.com/calrissian/mango/issues/193#issuecomment-141182808.

cjnolet commented 8 years ago

I think wee should add wrapped encoders for this. The reason we generally need to use these is so that we can persist the type name somewhere... namely Accumulo according to RYA and Accumulo Recipes.

eawagner commented 8 years ago

The reason I would suggest doing this is that, there are a lot of reasons to want a different alias, accuracy (XML schema types), conforming to naming standards (Elastic Search schema defs), or for better compression (Rya, Recipes Shard Tables). Obviously, we can't meet them all.

The current mechanism is there because it is accurate, not very large, and readable for when that is important. Like you I have found sevaral instances where the default is not always the best solution for every specific problem, but the API provides the means to customize the behavior of the API without complicating it.

Its really isn't that much work. The reason I use that utility method, is my type registry definition looks like this for one of our impls

public static final TypeRegistry<String> MY_TYPES = new TypeRegistry<String>(
            changeAlias(booleanEncoder(), BOOLEAN_ALIAS),
            changeAlias(byteEncoder(), BYTE_ALIAS),
            changeAlias(doubleEncoder(), DOUBLE_ALIAS),
            changeAlias(floatEncoder(), FLOAT_ALIAS),
            changeAlias(integerEncoder(), INTEGER_ALIAS),
            changeAlias(longEncoder(), LONG_ALIAS),
            changeAlias(stringEncoder(), STRING_ALIAS),
            changeAlias(uriEncoder(), URI_ALIAS)
    );   

Mostly the question becomes, at how many variations of different types of aliases do we support. Currently we indirectly support 2; the current alias, and the actual Class which also has to be unique. Theoretically, you could just use the class as the alias, but the getAlias method is there to allow anyone to customize that behavior to your specific needs.

cjnolet commented 8 years ago

The main reason I don't want this to be a separate change in a separate codebase is because fragmenting these classes means that changes to my local copy will break when updates are made in mango (new types supported, etc...). I want to keep them together.

I would have to disagree with you that the current types are "accurate, not very large, and readable". When I chose the names for the current type aliases, I was not thinking about either of those properties and specifically I was not thinking about space. Having taken that into account, better names probably would have been int, bool, str, etc... as those are less than half the sizes of the current ones.

You've mentioned 2 different cases so far that would warrant wanting to have types with different aliases. I propose we we support those two cases in Mango @ least, that way when changes are made to the type encoders, we can reflect them in the same codebase and not have users break because they were forced to wrap the classes and don't know about new classes that have been added.

Let's support XML aliases and "minimal" aliases.

On Thu, Sep 17, 2015 at 3:42 PM, eawagner notifications@github.com wrote:

The reason I would suggest doing this is that, there are a lot of reasons to want a different alias, accuracy (XML schema types), conforming to naming standards (Elastic Search schema defs), or for better compression (Rya, Recipes Shard Tables). Obviously, we can't meet them all.

The current mechanism is there because it is accurate, not very large, and readable for when that is important. Like you I have found instances where it is not always the best solution for every specific problem, but the API provides the means to customize the behavior of the API without complicating it.

Its really isn't that much work. The reason I use that utility method, is my type registry definition looks like this for one of our impls

public static final TypeRegistry MY_TYPES = new TypeRegistry( changeAlias(booleanEncoder(), BOOLEAN_ALIAS), changeAlias(byteEncoder(), BYTE_ALIAS), changeAlias(doubleEncoder(), DOUBLE_ALIAS), changeAlias(floatEncoder(), FLOAT_ALIAS), changeAlias(integerEncoder(), INTEGER_ALIAS), changeAlias(longEncoder(), LONG_ALIAS), changeAlias(stringEncoder(), STRING_ALIAS), changeAlias(uriEncoder(), URI_ALIAS) );

Mostly the question becomes, at how many variations of different types of aliases do we support. Currently we indirection support 2; the current alias, and the actual Class which also has to be unique. Theoretically, you could just use the class as the alias, but the getAlias method is there to allow anyone to customize that behavior to your specific needs.

— Reply to this email directly or view it on GitHub https://github.com/calrissian/mango/issues/193#issuecomment-141204479.

eawagner commented 8 years ago

The reason I don't want to support every single one of these use cases, is they are limited and we are working in an infinite space of what people would want to make their aliases. Some standards based, some for space reasons, some because it matches their UI. I am contending that there is a mechanism to support this for a given code base. I understand you want to have the latest that mango supports, but it makes things harder to maintain especially for other users creating their own encoders.

One of the more annoying things with the aliases is that it has to be unique in a type registry. That is ok, and we guarantee that all the aliases and Classes represented by the encoders are unique. Now if I want to add my own encoder, we would require an implementer to specify a class, alias, minimal alias(maybe as an integer), and XML alias, all of which would need to be unique. Also they need to verify that they would be unique across all mango and future mango updates to the default simple and lexicoder or you don't get the benefit of just getting the latest and greatest.

I do understand to want to update to the latest and greatest mango type encoders, but it would complicate both the TypeEncoder interface and the TypeRegistry API to support all of these usecases. For instance, currently there is only one decode method based on alias and one method that will look up the alias for a class.

public String getAlias(Object obj) {...}
public Object decode(String alias, U value) {...}

Supporting other alias mechanisms means we will need to expose these methods now.

public String getAlias(Object obj) {...}
public String getMinimalAlias(Object obj) {...}
public String getXMLSchema(Object obj) {...}
public Object decodeFromAlias(String alias, U value) {...}
public Object decodeFromMinimalAlias(String alias, U value) {...}
public Object decodeFromXMLSchema(String alias, U value) {...}

Also every single implementer of a custom type encoder would have to define each of these methods on the type encoder interface. I would assume just for XMLSchema, most people will just make it up anyway, since it makes no sense for their types(i.e. none defined for ipv6). This is just makes using the API more complicated to use. I doubt most people would even know the difference between why they would need to have a minimal alias vs just the simple alias.

All I am getting at is that there is already a relatively easy means to change the aliases. Yes, that means that you might not get the automatic goodness of using the simple type registries out of the box, but if you are worried about space, you should look at not only the alias size, but also the encoding size (see #137) which would require writing a whole new set of encoder implementations anyway.

eawagner commented 8 years ago

BTW per my last sentence in the the previous comment. I am for potentially writing a more compact set of encoders which would have more compact aliases and encodings, but not to modify the existing API.