M1 TODO - Githubissues

garyelephant commented 7 years ago

配置解析
- ~~spark common config(spark.*，还有appname, duration)解析~~
- ~~配置文件错误提示和定位~~
- 【非必需】实现if..else逻辑的代码【与插件流程体系直接相关】
- 【非必需】用户预定义模版变量，系统环境变量替换
插件流程体系
- ~~确定 BaseFilter最终接口定义（重点：filter(包括其他开发者的filter)根据需要自动注册为UDF）~~
- ~~确定BaseInput, BaseOutput的接口定义(考虑到broadcast, accumulator 的应用；与spark input,output format的关系)~~
- ~~在流程代码中支持多个 input, output~~
- Serializer 与其他Plugin的关系
- ~~能够集成外部开发者的插件（支持：Java/Scala）~~
- 【非必需】Field Reference
- 【非必需】支持if..else逻辑
Input，Filter，Output插件开发
- ~~Input 插件~~
- Filter 插件
- ~~Output插件~~
- Input, Filter, Output插件功能测试(spark on Yarn[client, cluster]模式，spark on Mesos, Local)
全流程简化
- 区分不同的build.sbt
- 接管整个spark + waterdrop 的流程。同时允许waterdrop以最简单spark job方式运行。
- 安装
- 部署(3种部署方式)
- 插件集成
- 配置
- 运行
中英文文档
- ~~统一的插件定义的文档格式~~
- 完整的中文文档（重点插件文档）
- 完整的英文文档（重点插件文档）

[在这个节点上线]

性能报告
- 【非必需】大数据量的稳定性，处理性能，一致性的测试。
- 【非必需】性能报告
- 【非必需】性能调优（并行度，filter体系代码）

garyelephant commented 7 years ago

细节方面的TODO:

logging
用户需求，并行度自由选择：repartition 增加或减少并行度，输出文件个数。
统一的插件定义的文档格式支持丰富的udf, Text(markdown), 支持多语言
把plugin单独拎出来作为一个archifact，方便插件依赖。
dstream／ spark streaming window operation： https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations, #42
Monitoring: StreamingListener/ App monitor
性能调优：input调优，如 dstream／receiver个数调优https://spark.apache.org/docs/latest/streaming-programming-guide.html, filter调优, output调优
input，output协作完成 exactly once 语义：ack, 幂等, 事务
支持离线计算

garyelephant commented 7 years ago

BaseInput的getDStream 返回类型不是通用的，预计在实现input插件时将遇到问题。

abstract class BaseInput(config: Config) extends Plugin {

  /**
   * No matter what kind of Input it is, all you have to do is create a DStream to be used latter
   * */
  def getDStream: DStream[(String, String)]

  /**
   * Things to do after filter and before output
   * */
  def beforeOutput: Unit = {}

  /**
   * Things to do after output, such as update offset
   * */
  def afterOutput: Unit = {}

}

garyelephant commented 7 years ago

Spark Benchmark: https://github.com/intel-hadoop/HiBench

garyelephant commented 6 years ago

2017年11月17日

（1）改为从command arguments读取： --conf spark.driver.extraJavaOptions=-Dconfig.path=application.conf

（2）spark.submit.deployMode

val spark = SparkSession
   .builder()
   .appName("SparkApp")
   .master("spark: //192.168.60.80:7077")
   .config("spark.submit.deployMode","cluster")
   .enableHiveSupport()
   .getOrCreate()

apache / seatunnel

M1 TODO #38