Closed OopsOutOfMemory closed 9 years ago
@mparkhe any suggestions?
Hi,
Assumptions: I am guessing you have used column name(s) “dim” and “mid” interchangeably. The reason I am making this assumption is because CUBE with INNER clause mandates that data is partitioned on those INNER columns. More on this later.
Cubert is a physical language by that your script below translates to 2 map reduce jobs. JOB 1: partition input data by unique combination of {“dim”, “session_id”} columns and store them as RUBIX blocks JOB 2: load blocks partitioned on INNER dimensions {“dim”, “session_id”} and compute cubes on various dimension columns (eg. of dimensions you have specified are log_date, app_name, app_platform, is_new…)
CUBE shuffle macro with INNER columns mandates that data is partitioned on those columns, else cubert compile will error out (throwing a PostConditionException). Using COUNT_DISTINCT aggregation with CUBE exploits “Partitioned Additive” aggregation propertyhttp://linkedin.github.io/Cubert/concepts/grouping_sets.html?highlight=cube#additive-and-partitioned-additive-aggregations. This cube macro will get re-written like so…
MAP { data = LOAD … partial_cubes = CUBE data BY $list_of_dimensional_columns INNER dim, session_id AGGREGATES SUM(pv) AS pv, COUNT_DISTINCT(dim) AS dim, COUNT_DISTINCT(session_id) AS session_id, SUM(bounce) AS bounce GROUPING SETS … } SHUFFLE partial_cubes PARTITIONED ON dim, session_id; // uses combiner with aggregate SUM REDUCE { cubes = GROUP partial_cubes BY $list_of_dimensional_columns AGGREGATES SUM(pv) AS pv, SUM(dim) AS dim, SUM(session_id) AS session_id, SUM(bounce) AS bounce; }
If you notice the reducer is summing up partial cube counts. Said other ways, what you are computing is a COUNT_DISTINCT(dim, session_id) as a combination and hence they are reported as the same.
To compute count_distinct cubes for session_id and dim independently — you need to see the correlation between “dim” and “session_id” columns. If there are not related in anyways — you will have do this calculation separately.
One of the ways of achieving it would be — JOB 1: blockgen partitioned on dim JOB 2: compute count distinct(dim) JOB 3: blockgen partitioned on session_id JOB 4: compute count_distinct(session_id) —note: the additive cubes on “pv” and “bounce” could be clubbed with either cubes. If you need to bring the data together you could use a join, but arguable that will be a smaller dataset.
A potential optimization would be to compute partial cubes (as describe in the MR job above) for count_distinct(session_id) and count_distinct(dim) in 2 separate cubert MR jobs. Then write a third cubert job to bring it all together in reduce-side join followed by a group by.
Hope this helps resolve your query.
mani
From:
@mparkhehttps://github.com/mparkhe any suggestions?
— Reply to this email directly or view it on GitHubhttps://github.com/linkedin/Cubert/issues/6#issuecomment-113345952.
Thank you very much! @mparkhe Your explanation is very detail and totally make sense, what I need is compute count_distinct(mid) and count_distinct(session_id) separately : )
By the way, if I had too many count_distinct(field) fields to be computed separately. I need to apply BLOCKGEN to each of the field, this causes too many BLOCKGEN pre-processing. Would this slowdown the whole process? Thanks again!!!
I think this issue is resolved, thanks~~
Hi, @mvarshney Maneesh Varshney:
Recently I was researching cubert.
I transformed a sql with cubert script using cube operator and grouping set operator.
Why the count_distinct(mid) and count_distinct(session_id) has the same result value after computing the cube grouping sets?
eg: count_distinct(mid) count_distinct(session_id)
500 500
200 200
Can anyone help me if I'm writing the wrong script??
Here is part of my code:
JOB1
Map{ data = LOAD xxx USING TEXT }
END
JOB2
Map {
data = LOAD "" USING RUBIX
}
cube data by
columns...
INNERT dim, session_id
AGGREGATES SUM(pv) as pv,
COUNT_DISTINCT(mid) as uv,
COUNT_DISTINCT(session_id) as visits,
SUM(bounce) as bounce
grouping sets
(log_date,app_name,app_platform),
(log_date,app_name,app_platform,is_new) ......