Open seadowg opened 6 hours ago
Very interested in thoughts from @chrissyhroberts here! @tobiasmcnulty has already built a working version of this for a specific use case (https://github.com/caktus/kafis), and I think adding a generalized version would be useful for others.
Yes so in the read me file I talk about
A) enrolment, B) verification and C) identification
The existing features cover enrolment and verification. This function would be identification.
There's already an example of how to do this with the CLI but an on device identification would be great. My guess is that it would probably be pretty unfeasible at scale, i.e. Because android devices have limited multi threading and the database of references templates would be big.
Obvious solution would be to partition the problem to reduce the set of references. This would be like using a choice filter
Let's say you start with a db of 10000 reference templates.
You aren't going to be able to compare index template to 10000 in rational time.
So you filter on region, gender, age group and whatever, each time reducing the reference set. After a few parameters are used you'll be down to tens of references, then you run the comparison.
Issue will be when people move district or change age because they got older and so on.
Very important feature
@chrissyhroberts Thanks for the quick response. Just to clarify, our CLI tool is more aimed at a deduplication context rather than identification per se.
That said, identification is an interesting thought and could certainly be accommodated in the Android app with much of the same code.
We observed around 240,000 matches per second on server hardware, so while the match rate would certainly be less on an Android device, it could still be quite tolerable, depending on the population size of course.
Currently, if you have sets of templates (the left and right thumbs for example) from individuals and want to find duplicates within that, you'd need to write a script that compares pairs of templates using
match
. Without a lot of work to deal with this in parallel, this would be very slow as every set has to be compared to each other set.Keppel's CLI could help with this by providing a command to find matches within a set of data and execute the comparisons in parallel to speed things up. Something like:
The input CSV could have columns
id
,template_1
,template_2
etc and the output CSV would haveid_1
,id_2
,score_1
,score_2
etc. The -t and -p arguments would allow customising the matching threshold (limiting which pairs get output) and the number of threads used to execute comparisons in parallel respectively.