ISG-ICS / cloudberry

Big Data Visualization
http://cloudberry.ics.uci.edu
90 stars 82 forks source link

Lookup Functionality #713

Closed baiqiushi closed 5 years ago

baiqiushi commented 5 years ago

The cloudberry middleware is able to get information from multiple datasets in one request. Similar to joins in SQL, multiple datasets can be joined using a joinKey. The JSON request sent to Cloudberry should specify the lookup dataset, the join key and the field that is to be looked up.

Example:

 "dataset":"twitter.ds_tweet",
 "lookup": [
    {
      "joinKey":["geo_tag.stateID"],
      "dataset":"twitter.US_population",
      "lookupKey":["stateId"],
      "select":["population"],
      "as" : ["population"]
    }
   ],
 "filter":[
    {
      "field":"text",
      "relation":"contains",
      "values":["zika","virus"}]
    }
  ],
  "select": {
    "order" : [],
    "limit" : 0,
    "offset" : 0,
    "field" : ["population"]
  }
}

The JSONParser will convert this into a Query object with appropriate Statement objects including the LookupStatement object. The Generator class (AQLGenerator, SQLGenerator, etc.) has a parsequery functions which will take this query object and transform it into appropriate query string for the backend. The lookup statement is parsed by parseLookup function.

The parseLookup function in the AQLGenerator, will generate a for - where - return subquery for the lookup variable every lookup Statement and append it to the list of produced variable which will be then used by more functions to select fields.

Example:

for $t in dataset twitter.ds_tweet
where similarity-jaccard(word-tokens($t.'text'), word-tokens('zika')) > 0.0
and contains($t.'text', "virus")
limit 0
offset 0
return
{ 'text': $t.'text', 'geo_tag.countyID': $t.'geo_tag'.'countyID', 'favorite_count': $t.'favorite_count', 'in_reply_to_user': $t.'in_reply_to_user', 'geo_tag.cityID': $t.'geo_tag'.'cityID', 'coordinate': $t.'coordinate', 'is_retweet': $t.'is_retweet', 'user.id': $t.'user'.'id', 'lang': $t.'lang', 'population': (for $l0 in dataset twitter.US_population
where $t.'geo_tag'.'stateID' /* +indexnl */ = $l0.stateId
return $l0.population)[0], 'in_reply_to_status': $t.'in_reply_to_status', 'user_mentions': $t.'user_mentions', 'create_at': $t.'create_at', 'user.status_count': $t.'user'.'status_count', 'geo_tag.stateID': $t.'geo_tag'.'stateID', 'retweet_count': $t.'retweet_count', 'id': $t.'id', 'geo': $t, 'hashtags': $t.'hashtags'}

Ongoing Work

The current implementation of the parseLookup generates a query that performs lookup on the secondary dataset for every record in the primary dataset. However, we might want to first aggregate the dataset on some field and then perform a lookup. This will reduce the number of lookups and improve performance.

To implement this, we need to add lookup JSON inside the group JSON. This will allow us to perform lookup after the aggregation has been applied in the parseGroupBy function. The new group statement will be something as follows

"group": {
    "by": [{
        "field": "geo.state",
        "as": "state"
    }],
    "aggregate": [{
        "field": "*",
        "apply": "count"
    }, {
        "field": "population",
        "apply": "sum",
        "as": "population"
    }],
    "lookup": [{
        "joinKey":"geo.state",
          "dataset":"US_population",
          "lookupKey":"city",
          "select":"population"
    }]
}

The parseGroupBy function should generate a statement that performs lookup after groupby.

295 #194

baiqiushi commented 5 years ago

Query API description added into Wiki page: Query Cloudberry