Closed garyelephant closed 6 years ago
细节方面的TODO:
exactly once
语义:ack, 幂等, 事务BaseInput的getDStream
返回类型不是通用的,预计在实现input插件时将遇到问题。
abstract class BaseInput(config: Config) extends Plugin {
/**
* No matter what kind of Input it is, all you have to do is create a DStream to be used latter
* */
def getDStream: DStream[(String, String)]
/**
* Things to do after filter and before output
* */
def beforeOutput: Unit = {}
/**
* Things to do after output, such as update offset
* */
def afterOutput: Unit = {}
}
Spark Benchmark: https://github.com/intel-hadoop/HiBench
2017年11月17日
(1)改为从command arguments读取: --conf spark.driver.extraJavaOptions=-Dconfig.path=application.conf
(2)spark.submit.deployMode
val spark = SparkSession
.builder()
.appName("SparkApp")
.master("spark: //192.168.60.80:7077")
.config("spark.submit.deployMode","cluster")
.enableHiveSupport()
.getOrCreate()
配置解析
spark common config(spark.*,还有appname, duration)解析配置文件错误提示和定位插件流程体系
确定 BaseFilter最终接口定义(重点:filter(包括其他开发者的filter)根据需要自动注册为UDF)确定BaseInput, BaseOutput的接口定义(考虑到broadcast, accumulator 的应用;与spark input,output format的关系)在流程代码中支持多个 input, output能够集成外部开发者的插件(支持:Java/Scala)Input,Filter,Output插件开发
Input 插件Output插件全流程简化
配置中英文文档
统一的插件定义的文档格式