ISG-ICS / cloudberry

Big Data Visualization
http://cloudberry.ics.uci.edu
90 stars 82 forks source link

Elasticsearch Adapter #584

Closed tahoe01 closed 4 years ago

tahoe01 commented 6 years ago

Overview

Develop an adapter to support Elasticsearch as the data search engine of Cloudberry.

Related Evaluation

Issue #568

Codebase

es-adpater

Issue 1: Transaction

Description:

Creating a view table includes four queries in AsterixDB:

Expected behavior

Send only one request to Elasticsearch and complete all of the four queries above sequentially.

Problem

Referred from an Elastic blog, there is no native support for transaction (consecutive sequence of SQL statements).

  1. In our initial implementation, Elasticsearch sometimes doesn't handle the request following our post order. For example, Elasticsearch may handle the request in the following order:

An error occurs when Elasticsearch handles the third request because the twitter index has been dropped.

  1. Our initial implementation first fetches selected records from Elasticsearch to Cloudberry and then injects the records to the destination index in Elasticsearch. In this way, we need to send two requests for this single query. In addition, there is a high overhead for Cloudberry to process a large amount of data.

Solution

Loop through a query list. The query list consists of three queries sent in three requests:

Note: In the loop, after each request is posted to Elasticsearch, we call Await.ready(). Then, the next request will not be posted until we get the response of the previous request. By doing so, the synchronization of three requests is realized in this loop.

Issue 2: Join (More details will be added later)

Description

When users search a keyword on the twittermap, a join query will be sent to join other datasets: state population, county population, and city population. Query results are used by Normalization button on twittermap.

Problem

Performing full SQL-style joins in a distributed system like Elasticsearch is prohibitively expensive. In addition, two forms of join in Elasticsearch are designed to scale horizontally. Referred from Elasticsearch, join is not recommended unless absolutely necessary. For example, data contains one-to-many relationship.

AsterixDB behavior

JOIN query translated by AsterixDB is shown below. It has a subquery to do aggregation first and then joins the population table. In this way, the cost of join operation is largely decreased because the relationship is one-to-one.

select tt.`state` as `state`,tt.`count` as `count`,ll0.`population` as `population`
from (
select `state` as `state`,coll_count(g) as `count`
from twitter.ds_tweet t
where t.`create_at` >= datetime('2018-01-02T00:00:00.000-0800') and t.`create_at` < datetime('2018-01-04T00:00:00.000-0800') and ftcontains(t.`text`, ['wang'], {'mode':'all'}) and t.`geo_tag`.`stateID` in [ 37,51,24,11,10,34,42,9,44,48,35,4,40,6,20,32,8,49,12,22,28,1,13,45,5,47,21,29,54,17,18,39,19,55,26,27,31,56,41,46,16,30,53,38,25,36,50,33,23,2 ]
group by t.geo_tag.stateID as `state` group as g
) tt
left outer join twitter.dsStatePopulation ll0 on ll0.`stateID` = tt.`state`

Possible solutions

  1. Multi-search: Send two queries in one request using Multi Search API of Elasticsearch. Then merge the responses of two queries.
  1. Data model denormalization: Add population data to each tweet. For example,
  1. Disable JOIN for Elasticsearch adapter

Limitation

TODO

Finish by Winter 2019 Week 9:

Finish by Fall 2019:

byhow commented 5 years ago

Continue to work on Ingestion PR.

baiqiushi commented 4 years ago

Done.