Closed ninjapapa closed 5 years ago
From a high level, there are three types of modules:
model and model exec modules would just become a generic module. In the example above, KerasModel
would be used for training and another generic module would just require the KerasModel
module to do the scoring. No need for a special model exec module.
Each connection instance would have a unique name for a given project. We would utilize the current SMV config parser to read the connection configuration under the smv.con
namespace. For example:
# my jdbc connection
smv.con.myjdbc.type = jdbc
smv.con.myjdbc.url = postgress://localhost:1000/...
smv.con.myjdbc.schema = myschema
# hdfs dir connection
smv.con.datadir1.type = hdfs
smv.con.datadir1.path = hdfs://namenode:32000/users/me/mydata
# another hdfs dir connection
smv.con.datadir2.type = hdfs
smv.con.datadir2.path = hdfs://namenode2/users/other/data
An instance of SmvInput
or SmvOutput
will only need to specify the "name" of the connection (e.g. "myjdbc" from above example). SMV will take care of looking up all the attributes declared by the connection class and creates an instance of the connection class to hand over to input/output instance.
Each smv.con.xxx.type
type would map to a concrete SmvConnection
class. For example, smv.con.datadir1.type.hdfs
would map to an SmvHdfsDirCon
class instance.
class SmvJdbcConnection extends SmvConnection:
attributes = [ "url", "user", "password" ]
def __init__(self, attrs):
self.url = attrs.url
self.user = attrs.user
self.password = attrs.password
# TBD: either called by SMV or input/output class.
def open(self):
return # connection object instance.
Base class defined by SMV for all CSV files on HDFS.
# base SMV class defining an input type
class SmvCsvHdfsInputFile extends SmvInput:
@abstract relPath : string
@absract connectionName : string
def doit():
ci = self.getConnection(self.connectionName)
c = ci.open() # assuming we are handling open/close from input/output directly for now.
fullPath = c.hdfsDirPath + "/" + self.relPath
df = open(fullPath)
return df
A single CSV file on HDFS. This would be what a user would create (or generate). Note that there is no way for user to specify a run
method or any processing here. This is pure input.
class MyCsvFile extends SmvCsvHdfsInputFile:
def relPath = "myfile.csv"
def connectionName = "datadir1"
where should the management of the actual connection occur (e.g. opening a JDBC connection). We have a couple of options:
open
/close
methods to the connection class that SMV would call before/after providing the connection to the input/output module.Option 1 above would allow for connection pooling and further optimizations but at the cost of more complexity at the SMV level.
SmvOutput
and should come up with a new name to avoid conflict.SmvConnection
to SmvConnectionInfo
to distinguish between connection attributes and the actual connection.@ninjapapa ready for review.
I don't understand this part:
For example, smv.con.datadir1.type.hdfs would map to an SmvHdfsDirCon class instance.
Another question is on the naming, we already have SmvModule
, it could be hard and confusing to create another.
Regarding class naming, how about:
SmvGenericModule <- SmvProcessModule
SmvGenericModule <- SmvIoModule
SmvProcessModule <- SmvSparkDfModule
SmvProcessModule <- SmvKerasModel
...
Should we have a
Never mind. We do need doit
method for both input and output, or a read
for input and write
for output?doit
.
Closing as the design is complete and we have already completed some of the implementation.
As part of the
SmvGenericModule
framework, the input and output side will have a redesign. On the high level, we will introduce the concept ofSmvConnection
, and aSmvGenereicInput
will have anSmvConnection
and a table/file pointer within that connection.SmvGenericOutput
is similar.This issue is to create the design document.