QCDIS / NaaVRE

BSD 3-Clause "New" or "Revised" License
3 stars 2 forks source link

Customize batch size in splitter #832

Open gpelouze opened 9 months ago

gpelouze commented 9 months ago

We want to allow users to configure either the batch size or total number of splits generated by the splitter.

gpelouze commented 4 months ago

We can use code from Berend: https://gist.github.com/gpelouze/7dcc95a4beb78276f324ff1ab3591a46

# New splitter with semaphore 
def rewrite_list_nested(in_list,semaphore):
    """
    B. Wijers 
    23-02-2024
    b.c.wijers (at) uva.nl
    for Lifewatch - NaaVRE

    Splitter
    input:
        list with elements
        TYPE: list
    output:
        list with semaphore number of nested elements
        TYPE: List[list]
    Notes:
    Will attempt to make an even split.
    In case no even splits can be made, the last
    set of elements which could not be split
    are evenly distributed across the nested lists.
    Even distribution means that each nested list
    receives up to a maximum of one extra element.
    In these cases the nested lists have varying
    amount of elements but will never exceed the
    difference by more than 1 element. 
    """
    out_list = []
    # Get the length of the list
    in_list_len = len(in_list)
    # Expecting integers so we can use floor division
    # Determine the maximum even-splits we can do
    worker_chunk_size = in_list_len // semaphore 
    # Determine how many elements could not be split
    leftovers = in_list_len%semaphore
    while in_list:
        # Set leftover to 0
        leftover = 0
        # If we could not make even-split
        if leftovers:
            # Can not exceed one element addition per split
            leftover = 1
            # Update the number of leftovers
            leftovers -= leftover
        # Add the nested list. If there was a leftover, add it
        out_list.append(in_list[:worker_chunk_size+leftover])
        # Update the input list and remove the elements we added to 
        # output list
        in_list = in_list[worker_chunk_size+leftover:]
    return out_list