TresAmigosSD / SMV

Spark Modularized View
Apache License 2.0
42 stars 22 forks source link

Create design doc for generic input and output modules #1484

Closed ninjapapa closed 5 years ago

ninjapapa commented 5 years ago

As part of the SmvGenericModule framework, the input and output side will have a redesign. On the high level, we will introduce the concept of SmvConnection, and a SmvGenereicInput will have an SmvConnection and a table/file pointer within that connection. SmvGenericOutput is similar.

This issue is to create the design document.

AliTajeldin commented 5 years ago

Class Diagram

class diagram

From a high level, there are three types of modules:

Models

model and model exec modules would just become a generic module. In the example above, KerasModel would be used for training and another generic module would just require the KerasModel module to do the scoring. No need for a special model exec module.

Config

Each connection instance would have a unique name for a given project. We would utilize the current SMV config parser to read the connection configuration under the smv.con namespace. For example:

# my jdbc connection
smv.con.myjdbc.type = jdbc
smv.con.myjdbc.url = postgress://localhost:1000/...
smv.con.myjdbc.schema = myschema

# hdfs dir connection
smv.con.datadir1.type = hdfs
smv.con.datadir1.path = hdfs://namenode:32000/users/me/mydata

# another hdfs dir connection
smv.con.datadir2.type = hdfs
smv.con.datadir2.path = hdfs://namenode2/users/other/data

An instance of SmvInput or SmvOutput will only need to specify the "name" of the connection (e.g. "myjdbc" from above example). SMV will take care of looking up all the attributes declared by the connection class and creates an instance of the connection class to hand over to input/output instance.

Each smv.con.xxx.type type would map to a concrete SmvConnection class. For example, smv.con.datadir1.type.hdfs would map to an SmvHdfsDirCon class instance.

Example Connection Impl

class SmvJdbcConnection extends SmvConnection:
  attributes = [ "url", "user", "password" ]

  def __init__(self, attrs):
    self.url = attrs.url
    self.user = attrs.user
    self.password = attrs.password

  # TBD: either called by SMV or input/output class.
  def open(self):
     return # connection object instance.

Example input module type

Base class defined by SMV for all CSV files on HDFS.

# base SMV class defining an input type
class SmvCsvHdfsInputFile extends SmvInput:
  @abstract relPath : string
  @absract connectionName : string

  def doit():
    ci = self.getConnection(self.connectionName)
    c = ci.open() # assuming we are handling open/close from input/output directly for now.
    fullPath = c.hdfsDirPath + "/" + self.relPath
    df = open(fullPath)
    return df

Example input file

A single CSV file on HDFS. This would be what a user would create (or generate). Note that there is no way for user to specify a run method or any processing here. This is pure input.

class MyCsvFile extends SmvCsvHdfsInputFile:
  def relPath = "myfile.csv"
  def connectionName = "datadir1"

TBD

connection management

where should the management of the actual connection occur (e.g. opening a JDBC connection). We have a couple of options:

Option 1 above would allow for connection pooling and further optimizations but at the cost of more complexity at the SMV level.

Notes:

AliTajeldin commented 5 years ago

@ninjapapa ready for review.

ninjapapa commented 5 years ago

I don't understand this part: For example, smv.con.datadir1.type.hdfs would map to an SmvHdfsDirCon class instance.

Another question is on the naming, we already have SmvModule, it could be hard and confusing to create another.

ninjapapa commented 5 years ago

Regarding class naming, how about:

SmvGenericModule <- SmvProcessModule
SmvGenericModule <- SmvIoModule

SmvProcessModule <- SmvSparkDfModule
SmvProcessModule <- SmvKerasModel
...
ninjapapa commented 5 years ago

Should we have a doit method for both input and output, or a read for input and write for output? Never mind. We do need doit.

AliTajeldin commented 5 years ago

Closing as the design is complete and we have already completed some of the implementation.