TresAmigosSD / SMV

Spark Modularized View
Apache License 2.0
42 stars 22 forks source link

provide something similar to Java Service Provider API in SMV #1498

Closed AliTajeldin closed 5 years ago

AliTajeldin commented 5 years ago

Problem

We have run into multiple issues now with solutions that require a way to "register" a set of classes that provide a type of service and then query for the list of all known such classes. For example, for connection classes, we would like to know all the known available connection classes so we can map the user specified type to the actual class type.

Solution

Need to create a lightweight solution (so no distributed consul, zookeeper, or other large scale external solutions) that is embedded within SMV. The proposal is to use python decorators (https://realpython.com/primer-on-python-decorators/#decorating-classes) to allow users to easily declare providers and their associated metadata.

Decorator

SMV will provide the SmvProvider class decorator that users can utilize to declare their provider classes/interfaces. For example:

@SmvProvider("conn", "jdbc")
class SmvJdbcConnectionInfo(SmvConnectionInfo):
   def attributes(): return ["url", ...]

  @abstract
  def url():

The SmvProvider takes two and only two arguments. The type of the service being provided and the name of this provider. In the example above, the type is conn and the name is jdbc. Users would be able to ask SMV for list of all known conn providers and so on.

Note: for the registration to work, the above code must be "executed". The enclosing python module must be imported someplace. The easiest way to do this is to enumerate all the provider files into a single "index" file and import the index file so force the import of all the provider files. SMV will do that automatically for all its known providers (e.g. SmvHdfsDirConnection)

Metadata

Rather than burden the provider registry with metadata, we should create a base provider class for each known provider type and declare abstract methods in the base class to force the concrete classes to provide the required metadata. The provider interface should be kept simple and minimal.

For example, assume we have a provider type X that needs a foo attribute for each concrete provider. Instead of adding extra foo parameter to the SmvProvider decorator, we should create a BaseX class that declares foo() method as abstract. So: WRONG

@SmvProvider("X", "myX", foo=15)
class MyX(Object):
  ...

CORRECT

class BaseX(Object)
  @abstract
   def foo():

@SmvProvider("X", "myX")
class MyX(BaseX):
  def foo(): return 15

TBD

AliTajeldin commented 5 years ago

@ninjapapa @laneb please take a look at the above before I start the impl.

laneb commented 5 years ago

Couple questions:

AliTajeldin commented 5 years ago

After some discussions with @laneb and @ninjapapa , we will utilize class hierarchy and code inspection to allow users to declare providers of various types rather than use the decorator pattern.

class diagram:

x

provider type

Each provider class in the hierarchy must provide a provider_type method that returns the provider type. The fqn of the provider type is the concatenation of all the provider types in the parent hierarchy of this provider.

provider query

Users will be able to call SmvProvider.get_providers_by_prefix() to get all known providers with a provider type fqn that matches the prefix. For example, to get the list of all connection providers, user would call get_providers_by_prefix("conn."). If we are interested in only spark model providers, we would query for get_providers_by_prefix("model.spark_ml.")

get_providers_by_prefix will return a dictionary of <fqn, provider_klass> of all providers that match the prefix.

provider discovery

Source code introspection will be used to discover known providers. Since all provider classes are directly or indirectly derived from SmvProvider, they will all have the IS_PROVIDER marker attribute.

The following directories will be scanned for providers:

Note: To avoid dynamic loading issues, the code scan is performed every time get_providers_by_prefix is called.

Example Providers

class SmvConnInfoProvider(SmvProvider):
  @staticmethod
  def provider_type(): return "conn"

class SmvJdbcConnInfoProvider(SmvConnInfoProvider):
  @staticmethod
  def provider_type(): return "jdbc"

  @staticmethod
  def attrs(): return [ "url", "driver", ... ]

Note: the provider fqn for SmvJdbcConnInfoProvider will be conn.jdbc.

The provider api is just for the discovery of classes. It doesn't have any semantic meaning of the classes it finds. In the example above, it is the responsibility of the SmvConnInfoProvider to understand that connection classes provide a static attrs method for determining the config parameters of the connection.

In the case of model discovery, the discovered class is just another SmvGenericModule or some derivative thereof. All the normal run, requiresDS and such would be provided by derived classes.

AliTajeldin commented 5 years ago

@ninjapapa @laneb ready for re-review.

AliTajeldin commented 5 years ago

depends on #1504

laneb commented 5 years ago
ninjapapa commented 5 years ago

LGTM. Just need to add some sample client code.

AliTajeldin commented 5 years ago

@ninjapapa added example code @laneb :

ninjapapa commented 5 years ago

@AliTajeldin will the jdbc attributes (such as url, driver, etc.) be attributes of SmvJdbcConnInfoProvider or a sub class of it?

AliTajeldin commented 5 years ago

@ninjapapa I updated the "example providers" section to make the example a bit clearer. In summary, the provider interface doesn't know anything about the classes it finds, it is only a discovery api. It is up to the discovered classes to implement whatever additional api they need.

In the case of connection providers, they will use the attrs() method to find out what attributes they need to read from config.

In the case of model classes, they are just another SmvGenericModule that the user can create derived classes (or generated by tool) based on attributes or info provided by the provider.

ninjapapa commented 5 years ago

Let's say we may introduce a method call list_data_in_conn. Where should that method belongs? More specific, should SmvJdbcConnInfoProvider implement that method?

AliTajeldin commented 5 years ago

Assuming list_data_in_conn has connection info specific knowledge, then, yes, it has to be implemented by SmvJdbcConnInfoProvider. It can either be static or instance method depending on what information it depends on.

ninjapapa commented 5 years ago

Sounds good. Only the naming convention may still have some room to discuss. On connection side, since ConnInfos are relatively simple, the direct user interface class (e.g. SmvJdbcConnInfoProvider) named as "Provider" is ok. However on the module side, it will be confusing to have name the user interface classes as provider. If we keep some in old convention, and some called "Provider", it will be more confusing. Suggest to make "provider" only be the name of the mixins, and user interface classes always have no "provider" postfix.

AliTajeldin commented 5 years ago

@ninjapapa makes sense 👍