Add ability to efficiently find duplicates in a large dataset to the CLI

LSHTM-ORK / ODK_Biometrics

MIT License

8 stars 2 forks source link

Add ability to efficiently find duplicates in a large dataset to the CLI #50

Open seadowg opened 6 hours ago

seadowg commented 6 hours ago

Currently, if you have sets of templates (the left and right thumbs for example) from individuals and want to find duplicates within that, you'd need to write a script that compares pairs of templates using match. Without a lot of work to deal with this in parallel, this would be very slow as every set has to be compared to each other set.

Keppel's CLI could help with this by providing a command to find matches within a set of data and execute the comparisons in parallel to speed things up. Something like:

keppel pmatch -i input.csv -o output.csv -t 40 -p 16

The input CSV could have columns id, template_1, template_2 etc and the output CSV would have id_1, id_2, score_1, score_2 etc. The -t and -p arguments would allow customising the matching threshold (limiting which pairs get output) and the number of threads used to execute comparisons in parallel respectively.

seadowg commented 6 hours ago

Very interested in thoughts from @chrissyhroberts here! @tobiasmcnulty has already built a working version of this for a specific use case (https://github.com/caktus/kafis), and I think adding a generalized version would be useful for others.

chrissyhroberts commented 5 hours ago

Yes so in the read me file I talk about

A) enrolment, B) verification and C) identification

The existing features cover enrolment and verification. This function would be identification.

There's already an example of how to do this with the CLI but an on device identification would be great. My guess is that it would probably be pretty unfeasible at scale, i.e. Because android devices have limited multi threading and the database of references templates would be big.

Obvious solution would be to partition the problem to reduce the set of references. This would be like using a choice filter

Let's say you start with a db of 10000 reference templates.

You aren't going to be able to compare index template to 10000 in rational time.

So you filter on region, gender, age group and whatever, each time reducing the reference set. After a few parameters are used you'll be down to tens of references, then you run the comparison.

Issue will be when people move district or change age because they got older and so on.

Very important feature

tobiasmcnulty commented 2 hours ago

@chrissyhroberts Thanks for the quick response. Just to clarify, our CLI tool is more aimed at a deduplication context rather than identification per se.

That said, identification is an interesting thought and could certainly be accommodated in the Android app with much of the same code.

We observed around 240,000 matches per second on server hardware, so while the match rate would certainly be less on an Android device, it could still be quite tolerable, depending on the population size of course.