Feature Request: Binary Search for leftmost or rightmost element

LouisJenkinsCS commented 6 years ago

If I have an array of duplicates, I would like to be able to choose whether or not I want to be able to deterministically search for either the first occurring or last occurring element I am searching for. For example, the below example provides unsatisfactory results.

use Search;

var arr1 = [1,2,3,4,5];
var arr2 = [1,2,3,3,3,4,5];
var arr3 = [1,2,3,3,3,3,3,4,5];

writeln(binarySearch(arr1, 3));
writeln(binarySearch(arr2, 3));
writeln(binarySearch(arr3, 3));

TIO

Produces the output...

(true, 3)
(true, 4)
(true, 5)

I would like to be able to easily specify, perhaps via a parameter, if I want the leftmost, rightmost, or any index that matches the element specified. Something like...

enum DuplicateAction {
   FIRST, LAST, ANY
}
proc binarySearch(
   Data: [?Dom], val, comparator: ?rec = defaultComparator, in lo = Dom.low, 
   in hi = Dom.high, param action : DuplicateAction = DuplicateAction.ANY
);

bradcray commented 6 years ago

I think this is a reasonable request, though I wouldn't make the argument a param as it's a case that seems unlikely to benefit from compile-time code specialization.

KING-SID commented 5 years ago

Hey @bradcray I see that you've marked this issue with the labels 'good first issue' and 'easy/straightforward'. I've spent some time learning Chapel through the docs on the main website and I'd like to start contributing to Chapel, could you guide me on how I can take on this issue?

bradcray commented 5 years ago

Hi @KING-SID:

I expect the implementation of this to be fairly straightforward. If you haven't found the code for the current binarySearch() yet, it lives here: https://github.com/chapel-lang/chapel/blob/master/modules/packages/Search.chpl#L160 Note that there are a few overloads at present, one for arrays with strided domains (e.g., 1..n by 2), one specialized to those without strided domains (e.g., 1..n... the common case, I'd guess), and one that generates an error message. I'm expecting it to be fairly easy to add a new optional argument to the existing procedures and get the different behaviors Louis would expect for the test cases he shows above.

I suspect that the most challenging aspect of this issue will be designing (and getting buy-in for) how the binarySearch() interface should change, simply because design is challenging by nature, and getting buy-in can sometimes be a challenge when people have different visions of how an interface should look. For example, what should the new argument be named? Should it be an enum as proposed above, or some other type? (e.g., an integer? a pair of booleans?). But this is a challenge worthy of taking on, otherwise nothing would ever improve.

How I normally approach (or suggest approaching) design questions like this is:

1) Look at other standard binary search library routines to see if you can find inspiration or a strong precedent to follow. For example, do the binary search libraries in Python, Java, C++, etc. have the ability to make decisions like this? If so, are there things about their design that you find attractive or unattractive? Do they all do the same thing (i.e., is there a standard approach that's common across languages), or do they differ? Can you find rationale for their choices, or feedback from users indicating whether the choices were considered attractive or not?

2) Once you have an idea (or a small number of ideas) about what the interface should look like, I'd put together a new issue with a title like "Design: proposal for selecting between duplicate binary search results", linking it to this issue by mentioning this one by number and/or mentioning that one here. Show your proposed change to the interface, give advantages or disadvantages of it relative to other approaches you considered, and ask for comments or feedback. Upon opening the issue, you can either mail the chapel-developers list asking for feedback or comments, or tag the team using the tag chapel-lang/core-contributors. I can help make sure people provide feedback on the design once we have one. [This could also be done as comments on this issue, but I typically find it cleaner to start a discussion around design from a fresh issue rather than an existing one].

Note that in saying this, I don't want to suggest that the design proposed here is bad or inappropriate in any way (that I'm aware of*), just that it pays to spend some time making sure the design has been reviewed and has buy-in before releasing it. This helps minimize the chances that it will have to change in the future if we realize that something's wrong or suboptimal about it, which would then likely break users' code.

Related to that, note that an advantage of adding an extra optional argument as Louis proposed here is that it should not break existing code since the argument doesn't need to be specified in codes that are already happy with the current interface.

Hope this is helpful. Let me know if you have further questions I can answer.

* = though, as noted above, I wouldn't make the argument a param as this will effectively make clones of the procedure for each value, which I don't think is necessary from an efficiency perspective and would potentially increase compile times.

KING-SID commented 5 years ago

Hey @bradcray Noted, thank you. I will start working on it! I'll let you know if any questions do pop up

KING-SID commented 5 years ago

So I will be tackling each task outlined above at a time. Task 1

Look at other standard binary search library routines to see if you can find inspiration or a strong precedent to follow. For example, do the binary search libraries in Python, Java, C++, etc. have the ability to make decisions like this? If so, are there things about their design that you find attractive or unattractive? Do they all do the same thing (i.e., is there a standard approach that's common across languages), or do they differ? Can you find rationale for their choices, or feedback from users indicating whether the choices were considered attractive or not?

I found some binary search methods in both Python and C++.

C++ In C++, they have 3 related functions: binary_search, lower_bound and upper_bound. binary_search simply checks if there is an instance but does not return an index of the instance. lower_bound can be used to find the first instance when duplicates exist (FIRST) upper_bound can be used to the last instance when duplicates exist (LAST)

Rationale: I believe that their general binary_search method does not return an index and is simply for checking whether or not there is an instance of the item is because this index is very random and unpredictable. It isn't really useful (in most cases) when you don't know where the pointer would be landing and I doubt that programmers would let computer decide what to select as they may instead prefer having a certain level of surety on the matter. It isn't predictable so it isn't very useful. Therefore, you either want the leftmost or the rightmost instance as that is predictable.

Python In Python, they have the bisect library with relevant functions being bisect, bisect_left and bisect_right. bisect and bisect_right are the same and they can be used to find the last instance when duplicates exist (LAST) bisect_left can be used to find the first instance when duplicates exist. there is no function for ANY instance for Python.

Rationale: I believe that they do not have a function to return any instance for the same argument above and as their bisect_left , bisect_rightand bisect functions also already check whether or not the item is in the list, having an ANY function would be unnecessary. So although they have three methods, it is either the leftmost instance or the rightmost instance that is returned.

For the cases above, they all had different functions. I think what would make Chapel's binarySearch stand out and powerful is if we did not approach it as C++ and Python by creating different functions, but instead we overloaded the function with optional parameters (as @LouisJenkinsCS said). Personally, I feel like it would look much cleaner this way given that all binarySearch is possible with just the one method called in different 'ways'.

Now here's the tricky part, should we even have the ANY option? Based on what I found after looking into Python and C++, it seems that in general, the ANY option is redundant. There are two arguments to disclude the ANY option:

The other binarySearch options already check if there is an instance of the item (2)
The index returned by the ANY option is random which may not be very useful. (Hence the reason for this issue in the first place)

Note: Chapel's current binarySearch methods already return a boolean based on whether it was found or not.

Now here are my arguments in support of the ANY option:

The philosophical argument: What about the cases for which a random index is required? Why are we trying to predetermine what the user should be able to do and not to do with Chapel? What if we created the ANY option and left it up to the users to decide whether they wanted a random index or not - give them the power to decide what they want to do with Chapel.
A practical argument: The ANY option would stop the code at the first instance of the item found whilst searching the list - this is different from it being the first instance of the item in the list. In this case, it is quite handy as it will be faster than the other methods and so is useful when the user simply just wants to quickly check whether the item exists in the list.

This would need to be a community decision. Hence, after putting together some designs, I will create a new issue referencing this one in which I will have all the details for each design laid out.

What are your thoughts on this?

LouisJenkinsCS commented 5 years ago

I am strongly in favor of the ANY option as it allows more data-parallel optimization. For example, imagine we have a Block or Cyclic distributed array, which would be expensive to shift to the left-most or right-most index if it happens to cross locale boundaries. One implementation could be to perform multiple binary-searches, one for each Block or strided domain.

coforall loc in Locales do on loc {
   coforall tid in 0..#here.maxTaskPar {
      // Setup per thread-per-locale stuff
      for i in subdomain(arr, loc.id, tid) {
         // Perform thread-per-locale binary search...
      } 
   }
}

This way it becomes embarrassingly parallel and is still correct; if a thread has a block or strided range that is out of bounds, it just stops; if there are multiple such ranges, the first to succeed can report their success. Stopping all other threads is another problem, but still an improvement.

bradcray commented 5 years ago

Nice analysis @KING-SID! I agree with your proposal to do this with one routine rather than several and tend to agree with Louis (and, reading between the lines, I think you as well?) that including the ANY option seems preferable (and, specifically, I would make it the default because it seems the cheapest to compute by definition).

I think you've correctly identified the next step as being to put together an issue proposing the interface you think we should implement and getting community comment on it (which I can help with).

KING-SID commented 5 years ago

Thanks @bradcray! Yup, I do! I will set up the issue in a bit once I have come up with designs. Do you think I should still propose 'control' options as well - without the ANY option and different methods?

LouisJenkinsCS commented 5 years ago

@KING-SID I think that's something to propose in the design, i.e proposing what both versions would look like.

KING-SID commented 5 years ago

@LouisJenkinsCS Alright cool then that's what I'll do

bradcray commented 5 years ago

I'd suggest making your issue focus on your proposed solution rather than a complete spanning of all possible designs. That said, I would put in some brief descriptions of other approaches and rationales for why you (we) don't think we want to take that approach. I just wouldn't give it equal time as the proposed approach itself.

KING-SID commented 5 years ago

The design I will propose includes adding an additional optional parameter for either the FIRST occurrence or the LAST occurrence. Not including this optional parameter will result in the binary search running as it presently does (basically the ANY option).

LouisJenkinsCS commented 5 years ago

I prefer having a default argument with where clauses to perform pattern matching.

// Default action: ANY
proc binarySearch(data, val, cmp, lo, hi, param direction = ANY) where direction == ANY
proc binarySearch(data, val, cmp, lo, hi, param direction = ANY) where direction == FIRST
proc binarySearch(data, val, cmp, lo, hi, param direction = ANY) where direction == LAST
// Non-Compile-Time version...
proc binarySearch(data, val, cmp, lo, hi, direction = ANY)

That way it can be invoked like such...

use Search;

var arr = [1,2,3,3,3,3,3,4,5];

writeln(binarySearch(arr, 3)); // Default to ANY (param)
writeln(binarySearch(arr, 3, Direction.FIRST));
writeln(binarySearch(arr, 3, Direction.LAST));

bradcray commented 5 years ago

I still don't understand what benefit you're hoping to get from the param, Louis. It seems incredibly unlikely to me that the specialization resulting from a param is going to cause any significant execution-time performance benefit, at the cost of compile-time code duplication and (in the proposal above) a more complicated implementation (four routines rather than one).

LouisJenkinsCS commented 5 years ago

Hm, I guess you're right. I was placing more emphasis on using pattern-matching where clauses than I should; personally I'd rather have it looks like this.

// Private function that performs highly-parallel binarySearch that implements 'ANY', likely similar
// to how I described above.
proc _binarySearch(...);
// Front-end user-facing public 'wrapper' function.
proc binarySearch(..., param direction) where direction == ANY {
   return _binarySearch(...);
}
proc binarySearch(..., param direction) where direction == FIRST {
   var (ix, found) = _binarySearch(...);
   if found {
      // Find index of left-most duplicate and return
   } else {
      // Return error
   }
}
proc binarySearch(..., param direction) where direction == LAST {
   var (ix, found) = _binarySearch(...);
   if found {
      // Find index of right-most duplicate and return
   } else {
      // Return error
   }
}
proc binarySearch(..., direction = Direction.ANY) {
   select direction {
      when Direciton.ANY do return binarySearch(..., Direction.ANY); // Param
      when Direction.FIRST do return binarySearch(..., Direction.FIRST); // Param
      when Direction.LAST do return binarySearch(..., Direction.LAST); // Param
   }
}

Yes you have 4 user-facing functions with 1 backend function that prevents code reuse, but I just prefer separating the logic into specific functions like this; the 3 param functions build on the main algorithm and just perform their own specific modifications and return that; plus this allows it to be more modular/extendible.

bradcray commented 5 years ago

This still seems overengineered to me, particularly if the user is only seeing one of these four functions in the public interface / documentation. I think the overlap between the three cases is much larger than the difference and find it hard to imagine a need to extend it much more than this.

KING-SID commented 5 years ago

Hey @LouisJenkinsCS given that setting the 'direction' is optional (optional argument), then do we really need to define an ANY ? Could not the optional arguments only pick between FIRST and LAST because ANY would be the default case if no direction was specified?

Please bear with my pseudo-code, I am still familiarising myself with Chapel, but this explains the design I am working on

binarySearch(array, value, first)
// first occurrence

binarySearch(array, value, last)
// last occurrence

binarySearch(array, value)
//any occurrence

LouisJenkinsCS commented 5 years ago

@bradcray

As I said, it's more a preference towards pattern-matching and modularity.

@KING-SID

Imagine the case where you want either FIRST, LAST, or ANY based on some run-time variable or compile-time configuration.

// Can be changed to Direction.FIRST or Direction.LAST
config param onDuplicate = Direction.ANY;
// ...
binarySearch(..., onDuplicate);

The above is legal, and if by 'optional' you mean it has a default value, then Direction.ANY is that default value. If you however wanted to invoke ANY vs FIRST or LAST, you'd need to branch based on which function you wanted (which feels unnecessary)

// Can be changed to Direction.FIRST or Direction.LAST
config param onDuplicate = Direction.ANY;
// ...
var ret : (eltType, bool);
if onDuplicate == ??? then
   ret = binarySearch(...);
else
   binarySearch(..., onDuplicate);

LouisJenkinsCS commented 5 years ago

One more thing in favor of having a more modular interface is that you may want to add more cases, such as serial.

enum Direction {
   ANY = 0,
   ANY_SERIAL = 1,
   FIRST = 2,
   FIRST_SERIAL = 3,
   LAST = 4,
   LAST_SERIAL = 5
}

In this case the user may want to say 'do this serially', which can be useful for not oversubscribing your system when you need each thread to perform a binary search.

bradcray commented 5 years ago

Chapel has a serial keyword... why not just rely on that rather than complicating the interface?

LouisJenkinsCS commented 5 years ago

1) There are optimizations you would make for sequential operations, such as not needlessly divide-and-conquer when you only have one thread, especially if you divide-and-conquer yourself. 2) It can result in deadlock when you assume things about how many tasks you can spawn at once, such as having barriers based on here.maxTaskPar or anything really.

proc parallelFn(x) {
    var customBarrier : atomic int;
    customBarrier.write(here.maxTaskPar);
    coforall tid in 1..here.maxTaskPar {
        var val = customBarrier.fetchSub(1);
        writeln("Task ", tid, " waiting with val ", val);
        customBarrier.waitFor(0);
        writeln("Task ", tid, " finished waiting...");
    }
}

serial do parallelFn(1);

TIO

Note: This can possibly be fixed if serial do writeln(here.maxTaskPar) always returned 1 since that is all that is available, but even if it did, it would just add extra code and logic.

Related note: Does serial transfer over when you do an on statement? I.E if you do serial do on Locales[1] do writeln(here.maxTaskPar), should this return 1 or should it be parallel?

bradcray commented 5 years ago

@LouisJenkinsCS: I'm really, really confused and am concerned that you're making this issue way more confusing and involved than I ever intended when I applied the easy / straightforward / good first issue labels. What kind of divide and conquer are you imagining applying to a binary search? ("The number we're looking for is less than the current value, so I'll go look productively in the lower half while you needlessly look in the upper half"?) Why would you ever use a custom barrier pattern in a divide and conquer or binary search algorithm anyway? I continue to feel like you're completely overengineering the solution that's necessary here and taking this issue way off track.

LouisJenkinsCS commented 5 years ago

Your response of 'Chapel has a serial keyword... why not just rely on that rather than complicating the interface?' made me think that this would be the suggested solution for constraining any parallel algorithm to running serially. My counter-example was directed more towards that assumption, but regardless you are right, this is getting off topic.

dgarvit commented 5 years ago

@bradcray @LouisJenkinsCS I have devised a solution in #12768. Kindly review.

bradcray commented 5 years ago

I won't have time to review this this week but will try to take a look next week. That said, I thought that @KING-SID was running with this issue, so also wants to make sure that he doesn't feel his toes have been stepped on. That said, it's been a month with no news... @KING-SID, any further progress here? If not, would you be interested in helping to review what @dgarvit has done?

soumyalahiri commented 4 years ago

@bradcray it's been almost a year since the last PR on this issue and it's still open. If no one is working on it right now, I'd like to look into this issue.

bradcray commented 4 years ago

I'm not aware of any need to start afresh on this issue independently of PR #12768. I think where things are stalled are (1) my not having the cycles to invest on the PR and (2) lack of discussion on the design (e.g., what should the enum values be named?). I think I'd suggest starting into a new effort rather than getting tangled up in this one, which is completely my fault for being stalled.

chapel-lang / chapel

Feature Request: Binary Search for leftmost or rightmost element #10761