ninjapapa commented 5 years ago

As part of the SmvGenericModule framework, the input and output side will have a redesign. On the high level, we will introduce the concept of SmvConnection, and a SmvGenereicInput will have an SmvConnection and a table/file pointer within that connection. SmvGenericOutput is similar.

This issue is to create the design document.

AliTajeldin commented 5 years ago

Class Diagram

class diagram

From a high level, there are three types of modules:

Input Module: have 0 inputs and 1 output and no run method.
Output Module: have 1 input and 0 outputs and no run method.
Generic Module: have N inputs and 0 outputs.

Models

model and model exec modules would just become a generic module. In the example above, KerasModel would be used for training and another generic module would just require the KerasModel module to do the scoring. No need for a special model exec module.

Config

Each connection instance would have a unique name for a given project. We would utilize the current SMV config parser to read the connection configuration under the smv.con namespace. For example:

# my jdbc connection
smv.con.myjdbc.type = jdbc
smv.con.myjdbc.url = postgress://localhost:1000/...
smv.con.myjdbc.schema = myschema

# hdfs dir connection
smv.con.datadir1.type = hdfs
smv.con.datadir1.path = hdfs://namenode:32000/users/me/mydata

# another hdfs dir connection
smv.con.datadir2.type = hdfs
smv.con.datadir2.path = hdfs://namenode2/users/other/data

An instance of SmvInput or SmvOutput will only need to specify the "name" of the connection (e.g. "myjdbc" from above example). SMV will take care of looking up all the attributes declared by the connection class and creates an instance of the connection class to hand over to input/output instance.

Each smv.con.xxx.type type would map to a concrete SmvConnection class. For example, smv.con.datadir1.type.hdfs would map to an SmvHdfsDirCon class instance.

Example Connection Impl

class SmvJdbcConnection extends SmvConnection:
  attributes = [ "url", "user", "password" ]

  def __init__(self, attrs):
    self.url = attrs.url
    self.user = attrs.user
    self.password = attrs.password

  # TBD: either called by SMV or input/output class.
  def open(self):
     return # connection object instance.

Example input module type

Base class defined by SMV for all CSV files on HDFS.

# base SMV class defining an input type
class SmvCsvHdfsInputFile extends SmvInput:
  @abstract relPath : string
  @absract connectionName : string

  def doit():
    ci = self.getConnection(self.connectionName)
    c = ci.open() # assuming we are handling open/close from input/output directly for now.
    fullPath = c.hdfsDirPath + "/" + self.relPath
    df = open(fullPath)
    return df

Example input file

A single CSV file on HDFS. This would be what a user would create (or generate). Note that there is no way for user to specify a run method or any processing here. This is pure input.

class MyCsvFile extends SmvCsvHdfsInputFile:
  def relPath = "myfile.csv"
  def connectionName = "datadir1"

TBD

connection management

where should the management of the actual connection occur (e.g. opening a JDBC connection). We have a couple of options:

provide an open/close methods to the connection class that SMV would call before/after providing the connection to the input/output module.
make the concrete input/output modules handle opening/closing the connections themselves. The connection object above would really just be the connect info and not the connection itself.

Option 1 above would allow for connection pooling and further optimizations but at the cost of more complexity at the SMV level.

Notes:

The names above are only suggestions and will probably change in the real implementation. For example, we already have an SmvOutput and should come up with a new name to avoid conflict.
If we decide to move connection management inside the input/output rather than inside SMV, we should rename SmvConnection to SmvConnectionInfo to distinguish between connection attributes and the actual connection.

AliTajeldin commented 5 years ago

@ninjapapa ready for review.

ninjapapa commented 5 years ago

I don't understand this part: For example, smv.con.datadir1.type.hdfs would map to an SmvHdfsDirCon class instance.

Another question is on the naming, we already have SmvModule, it could be hard and confusing to create another.

ninjapapa commented 5 years ago

Regarding class naming, how about:

SmvGenericModule <- SmvProcessModule
SmvGenericModule <- SmvIoModule

SmvProcessModule <- SmvSparkDfModule
SmvProcessModule <- SmvKerasModel
...

ninjapapa commented 5 years ago

~~Should we have a doit method for both input and output, or a read for input and write for output?~~ Never mind. We do need doit.

AliTajeldin commented 5 years ago

Closing as the design is complete and we have already completed some of the implementation.

TresAmigosSD / SMV

Create design doc for generic input and output modules #1484

Class Diagram

Models

Config

Example Connection Impl

Example input module type

Example input file

TBD

connection management

Notes: