Qihoo360 / XSQL

Unified SQL Analytics Engine Based on SparkSQL
https://qihoo360.github.io/XSQL/
Apache License 2.0
210 stars 62 forks source link

[CORE][ElasticSearch][Mongo] Support discover nested type for ElasticSearch. #70

Closed beliefer closed 4 years ago

beliefer commented 4 years ago

What changes were proposed in this pull request?

The ElasticSearch index log contains a type details. The mapping of details is:

curl -H 'Content-Type: application/json' -XGET 'http://127.0.0.1:9228/log/details/_mapping?pretty'
{
  "log" : {
    "mappings" : {
      "details" : {
        "dynamic" : "strict",
        "_all" : {
          "enabled" : false
        },
        "properties" : {
          "hasLog" : {
            "type" : "boolean"
          },
          "logDetail" : {
            "type" : "nested",
            "dynamic" : "strict",
            "properties" : {
              "confidence" : {
                "type" : "keyword"
              },
              "danger" : {
                "type" : "keyword"
              },
              "dbt" : {
                "type" : "date",
                "format" : "yyyy-MM-dd HH:mm:ss"
              },
              "detectNotify" : {
                "type" : "boolean"
              },
              "virusName" : {
                "type" : "text",
                "analyzer" : "virusinfo"
              }
            }
          },
          "md5" : {
            "type" : "keyword"
          },
          "sha1" : {
            "type" : "keyword"
          },
          "sha256" : {
            "type" : "keyword"
          }
        }
      }
    }
  }
}

We know the property logDetail uses nested type. XSQL parse nested type of ElasticSearch as nested. The nested just used for display. For example:

spark-xsql> desc details;
20/01/22 14:49:16 INFO SparkXSQLShell: current SQL: desc details
20/01/22 14:49:16 WARN SparkXSQLShell: hive.cli.print.header not configured, so doesn't print colum's name.
hasLog  boolean NULL
logDetail       nested  NULL
md5     keyword NULL
sha1    keyword NULL
sha256  keyword NULL
Time taken: 0.269 s

The type of column logDetail display as nested. But Spark can't use this type, we need to transform the nested type of ElasticSearch as StructType or ArrayType(StructType). Then we can query data with the sub properties of nested type, such as:

select hasLog, logDetail.confidence from details limit 5;
select logDetail.confidence from details group by logDetail.confidence;
select * from details where logDetail.confidence[0]="HEURISTIC";

This PR will resolve the issue and parse nested as the data type that Spark can use.

spark-xsql> desc details;
20/01/22 15:07:09 INFO SparkXSQLShell: current SQL: desc details
20/01/22 15:07:13 WARN SparkXSQLShell: hive.cli.print.header not configured, so doesn't print colum's name.
hasLog  boolean NULL
logDetail       array<struct<confidence:string,danger:string,dbt:string,detectNotify:boolean,virusName:string>> NULL
md5     string  NULL
sha1    string  NULL
sha256  string  NULL
Time taken: 3.768 s

Note: This PR reference https://github.com/apache/spark/pull/23353

How was this patch tested?

No UT.

beliefer commented 4 years ago

@WeiWenda Thank you!