When implement the hashtag tendency chart moudle #558, I want to write a query which only giving result of a specific hashtag, but it failed.
I add a filter contains hashtagName to the query: filter.push( { field: "hashtags", relation: "contains", values: [hashtagName] }. The datatype of hashtags is bag, and according to the document of cloudberry, it should support the filter contains.
But the server give an error:
And check the source code AsterixQueryGenerator.scala: 177, we find that the "contains" filter of datatype "bag" is not implemented in Cloudberry now.
Solution
For task #558, temporary solution is using javascript to do the filter job, but this solution is not possible in big dataset. Therefore, we need to solve this problem, and then continue to work on issue #558.
For this issue, the solution is to modify cloudberry, so that it can support filter of string datatype after unnest is applied to bag datatype. There are two tasks:
In the current Cloudberry, the datatype of "bag" does not change after unnest it. We should change the datatype to "string" after unnest the "bag" datatype.
If we use unnest: [{ hashtags: "tag" }], the query generated by Cloudberry to AsterixDB using "unnest0" as the name of new column instead of "tag". Therefore, we should modify the query generator to let it rename the new column as specified.
(Note: after reading source code, I find that this is not a issue, cloudberry will help to change specified name "tag" to "unnest0" when generating query. )
Experiment
The json send to cloudberry is
The query sent to AsterixDB by cloudberry is select `month` as `month`,coll_count(g) as `count` from twitter.ds_tweet_b5c0b187fe309af0f4d35982fd961d7e t unnest t.`hashtags` `unnest0` where not(is_null(t.`hashtags`)) and t.`create_at` >= datetime('2017-01-24T08:00:00.000Z') and t.`create_at` < datetime('2018-01-04T08:00:00.000Z') and t.`geo_tag`.`stateID` in [ 37,51,24,11,10,34,42,9,44,48,35,4,40,6,20,32,8,49,12,22,28,1,13,45,5,47,21,29,54,17,18,39,19,55,26,27,31,56,41,46,16,30,53,38,25,36,50,33,23,2 ] and `unnest0`="love" group by get_interval_start_datetime(interval_bin(t.`create_at`, datetime('1990-01-01T00:00:00.000Z'), year_month_duration("P1M") )) as `month` group as g;
The query result is
To Do
[x] Get familiar with the cloudberry source code.
[x] Change the datatype to string after unnest the bag datatype.
[x] Rewrite query, and solve problem in issue #558 "hashtag".
Background
When implement the hashtag tendency chart moudle #558, I want to write a query which only giving result of a specific hashtag, but it failed.
I add a filter contains
hashtagName
to the query:filter.push( { field: "hashtags", relation: "contains", values: [hashtagName] }
. The datatype of hashtags is bag, and according to the document of cloudberry, it should support the filter contains.But the server give an error:
And check the source code AsterixQueryGenerator.scala: 177, we find that the "contains" filter of datatype "bag" is not implemented in Cloudberry now.
Solution
For task #558, temporary solution is using javascript to do the filter job, but this solution is not possible in big dataset. Therefore, we need to solve this problem, and then continue to work on issue #558.
For this issue, the solution is to modify cloudberry, so that it can support filter of string datatype after unnest is applied to bag datatype. There are two tasks:
unnest: [{ hashtags: "tag" }]
, the query generated by Cloudberry to AsterixDB using "unnest0" as the name of new column instead of "tag". Therefore, we should modify the query generator to let it rename the new column as specified. (Note: after reading source code, I find that this is not a issue, cloudberry will help to change specified name "tag" to "unnest0" when generating query. )Experiment
The json send to cloudberry is
The query sent to AsterixDB by cloudberry is
select `month` as `month`,coll_count(g) as `count` from twitter.ds_tweet_b5c0b187fe309af0f4d35982fd961d7e t unnest t.`hashtags` `unnest0` where not(is_null(t.`hashtags`)) and t.`create_at` >= datetime('2017-01-24T08:00:00.000Z') and t.`create_at` < datetime('2018-01-04T08:00:00.000Z') and t.`geo_tag`.`stateID` in [ 37,51,24,11,10,34,42,9,44,48,35,4,40,6,20,32,8,49,12,22,28,1,13,45,5,47,21,29,54,17,18,39,19,55,26,27,31,56,41,46,16,30,53,38,25,36,50,33,23,2 ] and `unnest0`="love" group by get_interval_start_datetime(interval_bin(t.`create_at`, datetime('1990-01-01T00:00:00.000Z'), year_month_duration("P1M") )) as `month` group as g;
The query result is
To Do
Get familiar with the cloudberry source code.Change the datatype to string after unnest the bag datatype.Rewrite query, and solve problem in issue #558 "hashtag".