deva-rajan / hamake

Automatically exported from code.google.com/p/hamake
0 stars 0 forks source link

<foreach batch_size="..."> feature suggestion, with implementation #45

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
I'm working on using hamake to drive hive jobs (via <exec> and shell scripts, 
for now) and needed to be able to run
<foreach> jobs in batches for efficiency's sake.  In particular, this is useful 
in combination with hive's dynamic
partition inserts.

The attached patch is but one possible implementation of this functionality, 
selected mostly because it requires no
significant structural changes to the codebase.  To achieve this goal it 
extends the functionality of
Utils.replaceVariables() to allow for multi-valued (String[]) variables, and 
then uses this functionality in
Foreach.execute() to implement a new <foreach batch_size="..."> attribute.

Specifically, I have renamed the original function to 
Utils.replaceVariablesMultiValued() and then created a new
Utils.replaceVariables() that matches the original signature but that delegates 
its implementation to
Utils.replaceVariablesMultiValued().  The original implementation (now in 
Utils.replaceVariablesMultiValued()) has
been adjusted to detect String[] variable values and then return a list 
containing as many complete expansions of
the input string as there are elements in those String[] values.  An exception 
is thrown if the lengths of multiple
expanded String[] values do not match.

<foreach> now optionally takes a positive integer batch_size= attribute.  Not 
specifying the attribute is
tantamount to specifying batch_size="1".  If the batch_size is set to zero then 
there will be at most one task
execution, and it will include all input files newer than their corresponding 
output file(s).  If the batch size is
set to one (or omitted), the behavior is identical to the historical behavior: 
one task will be executed for each
input file that is newer than its corresponding output file.  If the batch size 
is set greater than one, then
ceil(N/M) tasks will be executed, where N is the number of input files newer 
than thier corresponding output
file(s), and M is the batch size.

A number of other implementation alternatives and alterations exist, from 
refactoring such that Tasks are directly
aware of batching to modifying the proposed implementation by allowing a 
concatFunc to be passed directly to
Utils.replaceVariables().  However, both of these examples and likely many 
other possibilities would have
tradeoffs and ramifications that extend beyond what I needed to achieve.

I'm posting this here in case it might be helpful for anyone else.  The 
attached patch is against trunk@372.

-peter

Original issue reported on code.google.com by petenewc...@gmail.com on 18 Jul 2011 at 12:58

Attachments:

GoogleCodeExporter commented 8 years ago
Vladimir please review. At first glance the patch looks OK.

Peter, thank you for the patch.

Original comment by kroko...@gmail.com on 18 Jul 2011 at 1:51

GoogleCodeExporter commented 8 years ago
Peter, I've reviewed and applied the patch. Later on, I will update project's 
documentation.

Thanks,
Vladimir

Original comment by v...@codeminders.com on 24 Jul 2011 at 5:57

GoogleCodeExporter commented 8 years ago
Thanks, Vladimir!

Don't know if I botched the original patch file or if there was bad interaction 
with other patches, but it seems that in hamakefile.xsd, the batch_size= 
attribute was added to <fold> instead of <foreach>.  The RNG version is correct.

Probably not necessary, but I've attached the patch required to fix.

-peter

Original comment by petenewc...@gmail.com on 25 Jul 2011 at 5:19

Attachments:

GoogleCodeExporter commented 8 years ago
Vladimir, have you beed able to commit fixed patch?
If yes, please close this ticket.

Original comment by kroko...@gmail.com on 29 Jul 2011 at 6:31

GoogleCodeExporter commented 8 years ago
Thank you, Peter. I've applied the last patch and updated documentation

Original comment by v...@codeminders.com on 1 Aug 2011 at 6:52