[KYC Match] Scoring - Githubissues

ToshiWakayama-KDDI commented 1 month ago

Problem description

To consider Scoring feature for KYC Match. (Spin off from Issue #65, item No.1, as per Action Item #13.03)

KevScarr commented 1 month ago

Hi @ToshiWakayama-KDDI Linking out to a thread / good discussion around the concepts for 'score': [#46] .

I would summarise and propose the below, where 'attribute' below is a field in the existing KYC specification:-

When a response is "attributeMatch: 'false'" we include an extra response field "attributeScore: 70".
Example:
- when "familyNameAtBirthMatch: 'false'" is returned, a new response field of "familyNameAtBirthScore: 70" is included
Rules:
- Numeric attributes are not checked: ie birthdate (distance scores wouldn't make sense)
- The response "attributeMatch" must be 'false'
- The Score value is a whole number (%): 0 to 100 (0 = no match, 100 = exact match)
- For consistency: Recommend using Jaro-Winkler distance algorithm as per other operators that are live today (after normalisation has been applied).

HuubAppelboom commented 1 month ago

Hi @KevScarr Why not provide the Score value as well when the "attributeMatch" is "true", but when there is a small difference (probably a spelling mistake on either side) ?? Or do you propose to provide only a "true" answer when the Score is 100% ?

KevScarr commented 1 month ago

@HuubAppelboom I would suggest a true equates to an exact match, ie =100. for close matches ie when you return a score allow the consuming service to judge if it's a close enough match or not to proceed (their use-cases will drive their error tolerance).

GillesInnov35 commented 1 month ago

hi @HuubAppelboom , @KevScarr, I understand that a score result (optional) might be added to a boolean attribute (True/False/ Not-avalaible) which is mandatory if provided in the request. Inthis case, I wonder if the boolean attribute is useful. At Orange the response contains only a score match result. Consumer has to decide. Gilles

KevScarr commented 1 month ago

@GillesInnov35 @HuubAppelboom Fair point; purely thinking about when a customer of the service migrates from the previous version to this version so backward compatibility would be important. I'd say the score is only provided when a boolean: false is returned; outside of that condition it offers little value. For Orange: Do you still respond with a not-available indicator? and can you share which algorithm you're using (JW?)

GillesInnov35 commented 1 month ago

yes sure Kevin, backward compatibility will be an important point, but as KYC Match version 1.0.0 has not been published I wonder if it is a problem. But may be it is. to answer to your question:

The matching algorithm implemented by the french MNOs is based on the Jaro–Winkler distance
The score is a value between 0 and 100, the higher the score, the more similar the strings, the value 100 means an exact match and the value 0 means there is no similarity.
The score « -1 » is a special value, it indicates that the requested value was not found by the MNO.

Thanks a lot for your active contribution Regards Gilles

KevScarr commented 1 month ago

Makes sense. So you would return a '-1' when the attribute wasn't available for checking, hence no requirement to have the boolean field in your current response.

If no MNO has implemented the current version then it's a fair shout to move towards a score only approach.

HuubAppelboom commented 1 month ago

@KevScarr @GillesInnov35 We may need to think of an approach which makes it possible to be extended further. For example, I think it may be a good idea to provide feedback whether the data is unverfied or has been verified by the MNO. That way we can provide a larger market reach, by also including unverified attributes, and the CSP can then decide whether to use that attribute or not.

ToshiWakayama-KDDI commented 1 month ago

Hi @GillesInnov35 , @HuubAppelboom , @KevScarr , all,

Thank you for your prompt comments/discussion, which I did not expect actually.

I should have informed you that there is KYC Match scoring enhancement proposal in the API Backlog WG, so, once we have received the proposal, we should proceed with our scoring discussion taking it into account. We should wait for it, but I don't think it will take long.

I will update the status.

Best regards, Toshi

ToshiWakayama-KDDI commented 1 month ago

Hi @GillesInnov35 , @HuubAppelboom , @KevScarr, all,

Our implementation is based on v0.1.0, and actully we do not need scoring feature, so, we would insist KYC Match API should work without scoring. It is the OGW original scope, I understand, and for a OGW global API, it is also important. In addition, as we all know, we have put our efforts into v0.1.0 already, so we should use our initial design and consider backward compatibility as much as possible, I believe.

Thanks, Toshi

HuubAppelboom commented 1 month ago

As a suggestion how to add score and other information to the API response, maintain backwards compatibility, and have something that can be expanded, we could add an extra string (when applicable) in the response for attributes where score is relevant.

For example the attributeMatch will have values "true", "false", "not_available" (like today) And we add an extra answer "attributeMatchInfo" that contain items like "score=89 unverified" to signal that the Jaro-Winkler score is 89, but that the source data has not been verified by the MNO. And when we have additional metadata, this can be added in future.

So for example you will get:

givenNameMatch : false givenNameMatchInfo : score=95 verified

GillesInnov35 commented 1 month ago

hi @ToshiWakayama-KDDI, all, thanks for your comment. I had a look at the API Backlog issue/PR opened by @jgarciahospital on API Enhancement Proposal KYC-Match Scoring. It is in line with our current discussion on how adding a match score level information, and so it is interesting. I'm afraid it'll be difficult to propose a backward compatibility if we've to replace a simple attribute by a object structure after version 0.1.0. This is just my point of view to be discussed. For example:

BR Gilles

claraserranosolsona commented 3 weeks ago

Hi all,

As advanced in last week meeting:

1) Telefonica has implemented v0.1.0, therefore we would need backwards compatibility in v0.2.0

2) This would be in line with the proposal of maintaining current true/false/not_available response and in the case of false, adding a score. For example:

• Keep current attributes-> "attributeMatch": true/false/not_available • If false, add additional parameters -> "attributeScore": X%

From the technical perspective, this should keep backwards compatibility as, based on OAS3, there is a parameter called “additionalProperties” which indicates if the object (our answer in this case) can have additional parameters not documented or not. The default value of “additionalProperties” is true, therefore in CAMARA we assume it is true. So the customer should be ready to receive additional parameters. It would be worth it to check this.

3) However, the proposal of changing a simple attribute to an object structure would not be an option for backwards compatibility, therefore not possible for us

4) Ok to proceed with the following rules proposed for the score:

• Numeric attributes are not checked: ie birthdate • The response "attributeMatch" must be 'false' • The Score value is a whole number (%): 0 to 100 (0 = no match, 100 = exact match) • Using Jaro-Winkler distance algorithm (after normalisation has been applied).

Regards, Clara

GillesInnov35 commented 3 weeks ago

hi all, thanks Clara for this detailed summary. If we must address backward compatibility because of v0.1.0 already deployed, I agree with you that we should add new optional score attributes. Do you think we've time to imagine a design based on OAS3 specifications in order to avoid a long list of attributes ? BR Gilles

KevScarr commented 2 weeks ago

Building on Issue #96 / we should follow the same design convention (define once, use many):-

ScoreMatchResult:
    type: integer
    description: Attribute comparison score as a percentage for string comparisons
    example: 85
    minimum: 0
    maximum: 100    

KYC_MatchResponse:
    type: object
    properties:

    idDocumentMatch:
        $ref: '#/components/schemas/MatchResult'

    nameMatch:
        $ref: '#/components/schemas/MatchResult'
        $ref: '#/components/schemas/ScoreMatchResult'

    givenNameMatch:
        $ref: '#/components/schemas/MatchResult'
        $ref: '#/components/schemas/ScoreMatchResult'

ScoreMatchResult to appear for all attribute fields, excluding the following fields as they are numeric/enum/ID based:-

idDocumentMatch
streetNumberMatch
birthdayMatch
genderMatch

When a field is numeric only in a particular country, as per the above summary, the score wouldn't be returned.

KevScarr commented 1 week ago

I've taken the attributes from the current version of the specification and following the rules given an initial view of which attributes can support a 'score' concept in full. It would be good to reach a common view across as many countries as possible, it'll then make updating the yaml spec straightforward.

Attribute	Optional Score Available	Comment
idDocumentMatch	No	It’s an ID number.
nameMatch	YES
givenNameMatch	YES
familyNameMatch	YES
nameKanaHankakuMatch	???	Are these fields in next release?
nameKanaZenkakuMatch	???	Are these fields in next release?
middleNamesMatch	YES
familyNameAtBirthMatch	YES
addressMatch	YES
streetNameMatch	YES
streetNumberMatch	YES	Is this houseName in some countries / assumption yes
postalCodeMatch	No	Being out by one letter can be a different place.
regionMatch	YES
localityMatch	YES
countryMatch	YES
houseNumberExtensionMatch	No	It’s numeric, not relevant.
birthdateMatch	No	It’s numeric, not relevant.
emailMatch	YES
genderMatch	No	It’s an enum type.

Some fields in some countries will be all numeric in others, a mixture. The table above captures which match attributes in the “KYC_MatchResponse” can support a ScoreMatch.

@ToshiWakayama-KDDI Should the nameKana*Match attributes also have scores in this next version of the specification (ie will these attributes remain here or be in an extension)?

fernandopradocabrillo commented 1 week ago

Building on Issue #96 / we should follow the same design convention (define once, use many):-

ScoreMatchResult:
    type: integer
    description: Attribute comparison score as a percentage for string comparisons
    example: 85
    minimum: 0
    maximum: 100  

KYC_MatchResponse:
    type: object
    properties:

    idDocumentMatch:
        $ref: '#/components/schemas/MatchResult'

    nameMatch:
        $ref: '#/components/schemas/MatchResult'
        $ref: '#/components/schemas/ScoreMatchResult'

    givenNameMatch:
        $ref: '#/components/schemas/MatchResult'
        $ref: '#/components/schemas/ScoreMatchResult'

Hi @KevScarr I agree with the porposal of creating a common schema for the response objects, but I don't fully understand what is the final result here. As far as I know in OAS3 we cannot use two $ref objects at the same level.

From TEF our proposal is mainly focused in not losing the retrocompatibility as we are already integrated with clients so the design could be simpler:

     idDocumentMatch:
         $ref: '#/components/schemas/MatchResult'
     idDocumentScoreMatch:
         $ref: '#/components/schemas/ScoreMatchResult'

We can document that the ScoreMatch properties will only be returned if the related property is false

GillesInnov35 commented 1 week ago

hi @fernandopradocabrillo, I think that with an allOf word it works well.

allOf:
        - $ref: '#/components/schemas/MatchResult'
        - $ref: '#/components/schemas/ScoreMatchResult'

to be confirmed I suppose BR Gilles

GillesInnov35 commented 1 week ago

hi @fernandopradocabrillo, you're right. My proposition bellow can't be applied.

allOf:
        - $ref: '#/components/schemas/MatchResult'
        - $ref: '#/components/schemas/ScoreMatchResult'

I agree with yours regarding backward compatibility which is expected. Gilles

ToshiWakayama-KDDI commented 6 days ago

Hi @KevScarr , all,

@ToshiWakayama-KDDI Should the nameKana*Match attributes also have scores in this next version of the specification (ie will these attributes remain here or be in an extension)?

Thank you for asking me about this. We would prefer to have scores for the nameKanaHankakuMatch and the nameKanaZenkakuMatch attributes in this next version.

Sorry for the late reply, as I needed to discuss this internally.

BR Toshi

ToshiWakayama-KDDI commented 5 days ago

Hi @KevScarr , @fernandopradocabrillo , @GillesInnov35 , @claraserranosolsona , all

I have a question for my clarification about way of scoring.

It seems that Jaro-Winkler distance algorithm will be used for scoring of string-type attributes (after normalisation has been applied), however, I think it should be up to each operator to choose the way how to calculate scoring.

The reason is, even though in Europe Jaro-Winkler distance algorithm could be used as the common way, it is unclear that Jaro-Winkler distance algorithm can be used for other languages, or, if it can be used for another language, it unclear that Jaro-Winkler distance algorithm is best suited for it. That is my concern, and actually we ourselves are not sure about using Jaro-Winkler distance algorithm for Japanease language.

So, is it OK that it will be up to each operator to choose the way how to calculate scoring, or, is there any other thought?

Thanks, Toshi KDDI

GillesInnov35 commented 4 days ago

hi @ToshiWakayama-KDDI , all, I don't really know if this algorithm works for all languages but it should (to be confirmed). I think we should validate an unique algo to have the same specifications and the same rules for all KYC Match API providers and avoid specific implementation.

BR Gilles

ToshiWakayama-KDDI commented 4 days ago

Hi @Gilles, Thanks for your comments.

"I think we should validate an unique algo to have the same specifications and the same rules for all KYC Match API providers and avoid specific implementation."

This is agreeable sentence, however, as Jaro-Winkler algorithm has not been proved effective for other languages than European languages, it would not be a better way to specify Jaro-Winkler algorithm as mandatory algorithm. If specific algorithms are needed in KYC Match API spec, for example, Jaro-Winkler could be recommendation for European languages, but algorithm for other languages should be TBD.

Would this be a possible way forward?

BR Toshi

camaraproject / KnowYourCustomer

[KYC Match] Scoring #85