elodina / go-avro

Apache Avro for Golang
http://elodina.github.io/go-avro/
Apache License 2.0
129 stars 55 forks source link

Proposed API updates/breakages #48

Open crast opened 9 years ago

crast commented 9 years ago

Since you're talking about having master be an API break in #47, I would like to collect a few changes to make this library a bit more idiomatic with go usages:

Nearly all of these changes, while technically breaking, shouldn't break the vast majority of user code, because most users aren't manipulating schema types or embedding the BinaryEncoder type, they just want to encode and decode from / to avro.

I'm happy to submit PR's for any and all of these changes if you approve of them

serejja commented 9 years ago

Hi @crast,

First of all, I'll be willing to accept this kind of PRs that will stabilize the API from user perspective. A couple of notes regarding your points:

  1. I think yes, we could make BinaryEncoder/Decoder unexported. By the time of writing them I wasn't thinking about keeping them exported or unexported so my bet was just to make them exposed if anyone wanted to use them directly. NewBinaryEncoder/Decoder should still be exported though as they are used in a couple of places.
  2. Totally agree, I already saw you implemented #50 and it looks good to me.
  3. Not sure I understood well the motivation behind this change so maybe some additional info about this change would be helpful.

One more note, you mentioned semantic versioning for gopkg.in in #47 and I think it's totally ok to go with this approach however I think the current state of things deserves to be a version 0 with possible API breakages. I think once we agree on all these changes described in this issue and #47 we are good to release v1 and make sure we don't make any breaking change without bumping the library version anymore.

The last thing, I'm going to merge #49 right now, it's a 100% good change. Any additional PRs/issues are more than welcome from you.

Thanks!

crast commented 9 years ago

Thanks for your reply and merging the PR's


For point 1: yes I'd keep NewBinaryEncoder and NewBinaryDecoder exported, to allow people to keep accessing them and using them - just that they'd now return an interface value, instead of a concrete type.


For point 3: The chain of logic that had me arrive at this suggestion goes something like

  1. Firstly, but not really a technical reason, the concrete types for all these schema are a lot of clutter in the base package's godoc. Moving them to a subpackage keeps them documented; but for most people, they're not working directly with the underlying objects like RecordSchema et al, they parse a schema into a Schema interface then pass that interface into a decoder/encoder. So they don't usually care about what's inside, only power users can do that, and find that in its own subpackage.
  2. For what it's worth, right now the datum writers and readers do things like assert *RecordSchema concrete type, and similar for a few others like *RecursiveSchema and such.
    • this means that someone can't just pass an object which happens to implement the Schema interface and expect it to decode, it will probably panic somewhere along the chain.
    • I'm not proposing to change this per se, but it could be dealt with if it was a priority, if one used superinterfaces, so that RecordSchema had an interface with methods to get the underlying list of schemafields and you asserted that interface instead of operating on the field. It would mean creating a handful of new interfaces, but they could be public documented interfaces.
  3. I see a future where the next access to a major decode/encode speedup is building an 'execution plan' for a specific RecordSchema and reflect.Type combination to decode into.
    • The stdlib encoding/json encoder follows this kind of pattern, building a new encoder every time it sees a new type, and caching it.
    • basically for every decode into a record it would be a list of function pointers and other useful metadata - so you're saving O(n) map lookups for findField, and O(n) traversals into a switch statement dispatch.
    • The problem I see with getting there right now is that parts of the concrete types like RecordSchema are mutable, that is, you can add/remove/modify fields inside it at any time. This makes trivially caching this schema harder, you'd have to walk through and hash all the internals every time, which might be more expensive than the savings we get from building the execution plan.
    • The quickest solution without changing the ability of building or working with schemas in code I see is to make an 'immutable' schema wrapper which recursively copies a schema into private variables, and returning that frozen schema. Something like a function FreezeSchema(s Schema) Schema (or maybe CompileSchema) which returns a special frozen schema where if you use this, it gets the advantage of using that cache for very fast encode/decode.

To your final note, yeah I think 0.x having a series of API breaks before 1.0 makes sense.

serejja commented 9 years ago
  1. This seems to be the easiest thing to do. If you feel like having schema implementations in a subpackage is a good idea and would simplify someone's life - let's go for it.
  2. I can't actually imagine now why would anyone implement their custom schemas that are not a part of Avro specification. However making datum writers/readers not depend on concrete schema implementations and the ability to just say "hey schema, go encode/decode yourself" is sure a good idea.
  3. We had a case when we had to modify the schema (e.g. we receive some data, decode it, update schema and add some metadata and then pass further down the chain) so I think we should definitely let schemas be mutable/updatable in some way. I personally like the last approach with an immutable wrapper. But I feel I should also ask other collaborators what they think of this point. Will get back to this today once more.
crast commented 9 years ago

for 2) Not a custom schema per se, just a type that happens to for example provide the facilities of RecordSchema plus some added functionality - this could, for example, solve the situation for the guy who wanted to store additional properties in the schema as an interface{}

for 3) I think it's valuable to have the schema types still be mutable for code generation and schema building / conversion / other scenarios. For example, someone might take a list of JSON entities, or records in a CSV file, and convert it to avro and not know the schema before making it. But, this is probably not the most common use case.

In many common use cases, people have a schema they know, and they want to encode/decode it, thousands or millions of times, fast. To have an immutable wrapper factory would make it possible to cache those at least for those who want the speedup, and you wouldn't be forced to use it.

I'm only making a conjecture that such an execution plan based approach would help speed things up, it remains to be seen, would have to make a trial run at it. If it's only a small speedup it's not worth it, naturally.

serejja commented 9 years ago

Ok, the second case is much clearer to me now, but this description definitely involves making datum readers/writers independent on schema implementations, e.g. schema should know how to decode/encode itself properly. Otherwise seems a cool feature to me.

Regarding the third case I agree that the most common use case is to decode/encode lots of records as much fast as possible. I think your proposed approach is definitely worth to try.

I'm not sure I can devote some time to help implementing this right now because I actually have lots of other things to do but if you can propose some kind of roadmap to achieve this I can definitely participate.

Thanks!