wyb commented 4 years ago

For many users who want to load data into doris for the first time, they have large amount of data, about 10G+，it is hard to support to load so large data into doris at one time using Broker load or Stream load. To resolve this problem, We proposal a new solution to load data by using spark cluster.

Spark clusters are used to preprocess data (bitmap global dict build, partition, sort, aggregation) in spark load, which can improve Doris load performance of large data volume and save the computing resources of Doris.

Spark load is mainly used for the initial migration from other systems or loading large amounts of data into Doris.

                 +
                 | 0. User create spark load job
            +----v----+
            |   FE    |---------------------------------+
            +----+----+                                 |
                 | 3. FE send push tasks                |
                 | 5. FE publish version                |
    +------------+------------+                         |
    |            |            |                         |
+---v---+    +---v---+    +---v---+                     |
|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
+---^---+    +---^---+    +---^---+                     |
    |4. BE push with broker   |                         |
+---+---+    +---+---+    +---+---+                     |
|Broker |    |Broker |    |Broker |                     |
+---^---+    +---^---+    +---^---+                     |
    |            |            |                         |
+---+------------+------------+---+ 2.ETL +-------------v---------------+
|               HDFS              +------->       Spark cluster         |
|                                 <-------+                             |
+---------------------------------+       +-----------------------------+

wyb commented 4 years ago

Design doc

2855 [Proposal] support spark load

2887 [Proposal] Support Spark Convert Doris Segment

3010 Spark load interface

imay commented 4 years ago

can refer this issue in related PR and issue.
I will create a project "Spark Load" to track this feature.
You can create an issue for each part of this project.

wangbo commented 4 years ago

Count Distinct Module

3319 Support Java Verision HyperLogLog(REVIEWING)

3061 Doris Support Using Hive Table to Build Global Dict(TESTING)

3088 Support Java version 64 bits Integers for BITMAP type(MERGED)

Spark DPP Module

3726 [Spark Load] Rollup Tree Builder

3728 [Spark Load] Using SparkDpp to complete some calculation in Spark Load

wyb commented 4 years ago

Resource manager

3418 [Spark load] Add resource manager (Merged)

Fe schedule job execution

3712 [Spark load][Fe 1/6] Add spark etl job config (Merged)

3718 [Spark load][Fe 2/6] Update push task thrift interface (Merged)

3715 [Spark load][Fe 3/6] Fe create job (Merged)

3819 [Spark load][Fe 4/6] Add hive external table and update hive table syntax in loadstmt (Merged)

3716 [Spark load][Fe 5/6] Fe submit spark etl job (Merged)

3717 [Spark load][Fe 6/6] Fe process etl and loading state job (Merged)

xy720 commented 4 years ago

Be handle push task

3742 [Spark load][Be 1/1] Be handle push task

Other

3878 [Spark load][broker load]Optimize reading parquet format file

apache / doris

[Spark load] Doris support Spark load #3433

2855 [Proposal] support spark load

2887 [Proposal] Support Spark Convert Doris Segment

3010 Spark load interface

3319 Support Java Verision HyperLogLog(REVIEWING)

3061 Doris Support Using Hive Table to Build Global Dict(TESTING)

3088 Support Java version 64 bits Integers for BITMAP type(MERGED)

3726 [Spark Load] Rollup Tree Builder

3728 [Spark Load] Using SparkDpp to complete some calculation in Spark Load

3418 [Spark load] Add resource manager (Merged)

3712 [Spark load][Fe 1/6] Add spark etl job config (Merged)

3718 [Spark load][Fe 2/6] Update push task thrift interface (Merged)

3715 [Spark load][Fe 3/6] Fe create job (Merged)

3819 [Spark load][Fe 4/6] Add hive external table and update hive table syntax in loadstmt (Merged)

3716 [Spark load][Fe 5/6] Fe submit spark etl job (Merged)

3717 [Spark load][Fe 6/6] Fe process etl and loading state job (Merged)

3742 [Spark load][Be 1/1] Be handle push task

3878 [Spark load][broker load]Optimize reading parquet format file