flipp-oss / deimos

Framework to work with Kafka, Avro and ActiveRecord
Other
59 stars 22 forks source link

Sub-schemas can conflict with each other #153

Closed iMacTia closed 2 years ago

iMacTia commented 2 years ago

I'm trying to understand what's the best way to manage schemas with similar/identical sub-schemas. For example, imagine to have 2 (or more!) separate schemas, both of which use an enum like the following:

{
  "doc": "The unit of the search_radius",
  "name": "search_radius_unit",
  "type": [
    "null",
    {
      "name": "radius_unit",
      "symbols": ["km", "mi"],
      "type": "enum"
    }
  ]
}

Now if I run bundle exec rake deimos:generate_schema_classes, I'll get a different class per schema, each including the sub-schema for the enum:

# Autogenerated Schema for Enum at com.indeed.flex.workers.events.activity.radius_unit
class RadiusUnit < Deimos::SchemaClass::Enum
  # @return ['km', 'mi']
  attr_accessor :radius_unit

  # :nodoc:
  def initialize(radius_unit)
    super
    self.radius_unit = radius_unit
  end

  # @override
  def symbols
    %w(km mi)
  end

  # @override
  def to_h
    @radius_unit
  end
end

Since all schemas are created inside the same namespace (Schemas), that means the two auto-generated Enum schema classes will have the exact Ruby class path (Schemas::RadiusUnit). Effectively, this is causing one to reopen and override the other when Ruby loads both classes.

That is already an issue by itself (needless repetition), but it could get even worse if you consider the case where the two enums are called the same, but not exactly the same. Imagine the second schema has a different symbols value, like "symbols": ["kilometers", "miles"].

Now we end up with 2 ruby classes called exactly the same, but with a different implementation of the symbols method. And since the last loaded class overrides the former(s), that means one of the two classes is now inconsistent with the corresponding schema, causing it to prevent sending messages.

cc @DeeChau

dorner commented 2 years ago

Hmm... I feel like this is something that probably should be configurable. I can see there being value in having a top-level enum constant (e.g. when you know that the enums actually are identical). On the other hand, when they aren't identical, it might be better to generate the enum class inside the top-level class instead and therefore it would be namespaced (so you'd get Schema1::RadiusUnit and Schema2::RadiusUnit.

The problem is that it's kind of hard to configure this behavior on a per-schema basis. Probably to start with we'd want it to be either on or off.

iMacTia commented 2 years ago

I totally appreciate this is a complex issue to solve, and I'm happy to help brainstorm about it. Below I'm pouring my thoughts to spark some discussion about possible ways forward.

The current solution still generates duplicated classes for each time the enum is used, but it also introduces an issue when the classes are different despite being called the same way. Moreover, this is also an issue with top-level classes, as nothing technically stops you from having two schemas called the same, as far as they're under separate namespaces. The principle of least surprise would advocate for defaulting to the namespaced solution, as that would avoid the case where a naming conflict ends up causing an unexpected class override (which would be hard to spot and would be allowed by the Ruby interpreter despite causing issues down the line).

That said, I also understand two possible holdbacks here:

  1. Schema namespaces are actually long by convention (e.g. com.company.sub-company.namespace.class) and having this level of nesting on a Ruby class can be annoying
  2. Backwards compatibility: if we change the default, it would make the update harder.

A possible solution to (1) could be to allow for a common_prefix config, which allows to exclude part of the namespace from the generated classes. So for example, com.company.sub-company.namespace.my-schema today would generate a Schemas::MySchema class, but by configuring common_prefix: 'com.company.sub-company' the generated class would be Schemas::Namespace::MySchema or Namespace::Schemas::MySchema.

Issue (2) is a different issue, but would need to be addressed only if/when we decide to make this new configuration the default, which I suggest doing but would definitely require a v2.0 release. The configuration can be added as non-default in 1.x versions

dorner commented 2 years ago

I like this idea! I'm wondering if we can even make it relatively smarter and just automatically clip everything but the last part of the namespace. At least the way I've been using schemas, it's almost always the last part that's important. But I'm good with the idea of a common prefix - or even a namespace mapping, where you could do something like this:

config.schema_namespace_map = {
  'com.company.subcompany.namespace' => 'Namespace',
  'com.company.subcompany2.namespace' => 'SubCompany2',
  'com.whatever' => nil # no namespace
}
iMacTia commented 2 years ago

We could take a convention-over-configuration approach (latest config override the previous ones):

  1. Strip the whole namespace except for the last bit by default (convention)
  2. Allow to set a common_prefix option
  3. Allow to override the default with config.schema_namespace_map

And this behaviour would be disabled by default, and activated by setting config.schema.namespace_generated_class = true. Then in a future major release you may decide to make this the new default.

If that sounds good, we could have this split into 2 or even three steps, to avoid dropping too many changes at once. I can see points 1 and 2 being done first (together or individually), and then step 3 (namespace mapping) can be introduced later down the line based on feedback.

How does that sound?

dorner commented 2 years ago

That sounds great to me! I agree with the approach.

dorner commented 2 years ago

@iMacTia have you had a chance to look into this yet? We're starting to use this feature more and definitely feeling the pain of how we have it set up now. :) I can grab this but wanted to know if you'd started work yet.

iMacTia commented 2 years ago

@dorner this is still sitting on my todo list, wrestling with a bunch of other stuff I need to prioritize. I need to drive the acquisition of Kafka across the company before I can justify spending more time on this. Although I'd love to work on this, I can't really say when that will happen, so I'm happy with you tackling this if you have some time to work on it 😃

dorner commented 2 years ago

Fixed with #154.