Bug in `lookup_contactbyname` algorithm

The algorithm that suggests contacts has a weakness.

When we search for a contact such as "Max Mustermann", we get a wrong probability if there are other contacts with partially matching but shorter names, such as "Max" or "Mustermann".

The problem is that the number of matches of all single name mutations with the same contact has no influence on the probability so far.

Also, I think that squaring the $new_probability should be done before comparing it to the old $probability. Otherwise, there could be a small but unintended shift in each iteration.

I did my best to find a better algorithm and if you like it, I can do a PR.

Here is my code for testing:

<?php

/**
 * Fake api function for testing.
 */
function fake_civicrm_api3($entity, $action, $params)
{
    switch ($params['name']) {
        case "Max":
            return [
                "values" => [
                    ["id" => 1, "sort_name" => "Mustermann, Max"],
                    ["id" => 2, "sort_name" => "Mustardman, Max"],
                    ["id" => 3, "sort_name" => "Max"],
                ]
            ];
        case "Mustermann":
            return [
                "values" => [
                    ["id" => 1, "sort_name" => "Mustermann, Max"],
                    ["id" => 4, "sort_name" => "Mustermann, Marc"],
                    ["id" => 5, "sort_name" => "Mustermann"],
                ]
            ];
        case "Marie":
            return [
                "values" => [
                    ["id" => 6, "sort_name" => "Zimmermann, Marie"],
                    ["id" => 7, "sort_name" => "Meier, Marie"],
                ]
            ];
        default:
            return ["values" => []];
    }
}

/**
 * Original algorithm
 */
function _original_civicrm_api3_banking_lookup_contactbyname_api($name_mutations, $params) {
    $contacts_found = array();
    // query quicksearch for each combination
    foreach ($name_mutations as $name_mutation) {
        $result = fake_civicrm_api3('Contact', 'getquick', array('name' => $name_mutation));
        foreach ($result['values'] as $contact) {
            // get the current maximum similarity...
            if (isset($contacts_found[$contact['id']])) {
                $probability = $contacts_found[$contact['id']];
            } else {
                $probability = 0.0;
            }

            // now, we'll have to find the maximum similarity with any of the name mutations
            $compare_name = strtolower($contact['sort_name']);
            foreach ($name_mutations as $name_mutation) {
                $new_probability = 0.0;
                similar_text(strtolower($name_mutation), $compare_name, $new_probability);
                //error_log("Compare '$name_mutation' to '".$contact['sort_name']."' => $new_probability");
                $new_probability /= 100.0;
                if ($new_probability > $probability) {
                    // square value for better distribution, multiply by 0.999 to avoid 100% match based on name
                    $probability = $new_probability * $new_probability * 0.999;
                }
            }

            $contacts_found[$contact['id']] = $probability;
        }
    }

    return $contacts_found;
}

/**
 * New and improved algorithm
 */
function _new_civicrm_api3_banking_lookup_contactbyname_api($name_mutations, $params) {
    $contacts_found = [];
    $occurrences = [];
    // query quicksearch for each combination
    foreach ($name_mutations as $name_mutation) {
        $result = fake_civicrm_api3('Contact', 'getquick', array('name' => $name_mutation));
        foreach ($result['values'] as $contact) {
            $occurrences[$contact['id']] = isset($occurrences[$contact['id']]) ? $occurrences[$contact['id']] + 1 : 1;
            // get the current maximum similarity...
            if (isset($contacts_found[$contact['id']])) {
                $probability = $contacts_found[$contact['id']];
            } else {
                $probability = 0.0;
            }

            // now, we'll have to find the maximum similarity with any of the name mutations
            $compare_name = strtolower($contact['sort_name']);
            foreach ($name_mutations as $name_mutation) {
                $new_probability = 0.0;
                similar_text(strtolower($name_mutation), $compare_name, $new_probability);
                //error_log("Compare '$name_mutation' to '".$contact['sort_name']."' => $new_probability");
                $new_probability /= 100.0;
                // square value for better distribution, multiply by 0.999 to avoid 100% match based on name
                $new_probability = $new_probability * $new_probability * 0.999; // Do this before comparison with the old $probability
                if ($new_probability > $probability) $probability = $new_probability;
            }
            $contacts_found[$contact['id']] = $probability;
        }
    }

    $max_occurrence = max($occurrences);
    foreach ($contacts_found as $contact_id_found => $contact_probability) {
        $contacts_found[$contact_id_found] = $contact_probability * $occurrences[$contact_id_found] / $max_occurrence;
    }
    return $contacts_found;
};

$name_mutations = ["Max", "und", "Marie", "Mustermann"];

$results_original = _original_civicrm_api3_banking_lookup_contactbyname_api($name_mutations, []);
$results_new = _new_civicrm_api3_banking_lookup_contactbyname_api($name_mutations, []);

foreach ($results_original as $contact_id => $probability_original) {
    $probability_new = $results_new[$contact_id];
    print("[$contact_id => ['probability_original' => $probability_original], ['probability_new' => $probability_new]],\n");
}

Result:

[1 => ['probability_original' => 0.63936], ['probability_new' => 0.63936]],
[2 => ['probability_original' => 0.4091904], ['probability_new' => 0.2045952]],
[3 => ['probability_original' => 0.999], ['probability_new' => 0.4995]],
[6 => ['probability_original' => 0.26859259259259], ['probability_new' => 0.1342962962963]],
[7 => ['probability_original' => 0.20640495867769], ['probability_new' => 0.17283737024221]],
[4 => ['probability_original' => 0.59112426035503], ['probability_new' => 0.29556213017751]],
[5 => ['probability_original' => 0.999], ['probability_new' => 0.4995]],

Project60 / org.project60.banking

Bug in `lookup_contactbyname` algorithm #403

Contact	Original Probability	New Probability	Ranking old	Ranking new
"Mustermann, Max"	0.63936	0.63936	2	1
"Mustardman, Max"	0.4091904	0.2045952	4	4
"Max"	0.999	0.4995	1	2
"Zimmermann, Marie"	0.26859259259259	0.1342962962963	6	6
"Meier, Marie"	0.20640495867769	0.17283737024221	5	5
"Mustermann, Marc"	0.59112426035503	0.29556213017751	3	3
"Mustermann"	0.999	0.4995	1	2