LinkedInAttic / Cubert

Fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop
http://linkedin.github.io/Cubert/
Apache License 2.0
246 stars 61 forks source link

COUNT_DISTINCT computed same value of two dimension key #6

Closed OopsOutOfMemory closed 9 years ago

OopsOutOfMemory commented 9 years ago

Hi, @mvarshney Maneesh Varshney:

Recently I was researching cubert.

I transformed a sql with cubert script using cube operator and grouping set operator.

Why the count_distinct(mid) and count_distinct(session_id) has the same result value after computing the cube grouping sets?

eg: count_distinct(mid) count_distinct(session_id)

500 500

200 200

Can anyone help me if I'm writing the wrong script??

Here is part of my code:

JOB1

Map{ data = LOAD xxx USING TEXT }

 BLOCKGEN data BY SIZE 1000000 PARTITIONED ON mid, session_id;

 STORE data INTO "/cubert/temp" USING RUBIX("overwrite":"true");

END

JOB2

Map {

data = LOAD "" USING RUBIX

}

cube data by

columns...

INNERT dim, session_id

AGGREGATES SUM(pv) as pv,

COUNT_DISTINCT(mid) as uv,

COUNT_DISTINCT(session_id) as visits,

SUM(bounce) as bounce

grouping sets

(log_date,app_name,app_platform),

(log_date,app_name,app_platform,is_new) ......

OopsOutOfMemory commented 9 years ago

@mparkhe any suggestions?

mparkhe commented 9 years ago

Hi,

Assumptions: I am guessing you have used column name(s) “dim” and “mid” interchangeably. The reason I am making this assumption is because CUBE with INNER clause mandates that data is partitioned on those INNER columns. More on this later.

Cubert is a physical language by that your script below translates to 2 map reduce jobs. JOB 1: partition input data by unique combination of {“dim”, “session_id”} columns and store them as RUBIX blocks JOB 2: load blocks partitioned on INNER dimensions {“dim”, “session_id”} and compute cubes on various dimension columns (eg. of dimensions you have specified are log_date, app_name, app_platform, is_new…)

CUBE shuffle macro with INNER columns mandates that data is partitioned on those columns, else cubert compile will error out (throwing a PostConditionException). Using COUNT_DISTINCT aggregation with CUBE exploits “Partitioned Additive” aggregation propertyhttp://linkedin.github.io/Cubert/concepts/grouping_sets.html?highlight=cube#additive-and-partitioned-additive-aggregations. This cube macro will get re-written like so…

MAP { data = LOAD … partial_cubes = CUBE data BY $list_of_dimensional_columns INNER dim, session_id AGGREGATES SUM(pv) AS pv, COUNT_DISTINCT(dim) AS dim, COUNT_DISTINCT(session_id) AS session_id, SUM(bounce) AS bounce GROUPING SETS … } SHUFFLE partial_cubes PARTITIONED ON dim, session_id; // uses combiner with aggregate SUM REDUCE { cubes = GROUP partial_cubes BY $list_of_dimensional_columns AGGREGATES SUM(pv) AS pv, SUM(dim) AS dim, SUM(session_id) AS session_id, SUM(bounce) AS bounce; }

If you notice the reducer is summing up partial cube counts. Said other ways, what you are computing is a COUNT_DISTINCT(dim, session_id) as a combination and hence they are reported as the same.

To compute count_distinct cubes for session_id and dim independently — you need to see the correlation between “dim” and “session_id” columns. If there are not related in anyways — you will have do this calculation separately.

One of the ways of achieving it would be — JOB 1: blockgen partitioned on dim JOB 2: compute count distinct(dim) JOB 3: blockgen partitioned on session_id JOB 4: compute count_distinct(session_id) —note: the additive cubes on “pv” and “bounce” could be clubbed with either cubes. If you need to bring the data together you could use a join, but arguable that will be a smaller dataset.

A potential optimization would be to compute partial cubes (as describe in the MR job above) for count_distinct(session_id) and count_distinct(dim) in 2 separate cubert MR jobs. Then write a third cubert job to bring it all together in reduce-side join followed by a group by.

Hope this helps resolve your query.

mani

From: , Li notifications@github.com<mailto:notifications@github.com> Reply-To: linkedin/Cubert reply@reply.github.com<mailto:reply@reply.github.com> Date: Thursday, June 18, 2015 at 7:33 PM To: linkedin/Cubert Cubert@noreply.github.com<mailto:Cubert@noreply.github.com> Cc: Mani Parkhe mparkhe@linkedin.com<mailto:mparkhe@linkedin.com> Subject: Re: [Cubert] COUNT_DISTINCT computed same value of two dimension key (#6)

@mparkhehttps://github.com/mparkhe any suggestions?

— Reply to this email directly or view it on GitHubhttps://github.com/linkedin/Cubert/issues/6#issuecomment-113345952.

OopsOutOfMemory commented 9 years ago

Thank you very much! @mparkhe Your explanation is very detail and totally make sense, what I need is compute count_distinct(mid) and count_distinct(session_id) separately : )

By the way, if I had too many count_distinct(field) fields to be computed separately. I need to apply BLOCKGEN to each of the field, this causes too many BLOCKGEN pre-processing. Would this slowdown the whole process? Thanks again!!!

OopsOutOfMemory commented 9 years ago

I think this issue is resolved, thanks~~