ACEMS / ACEMS-ABS2017

This respository houses details for the ACEMS + ABS Collaborative Research Workshops held on October 17th, 2017.
0 stars 0 forks source link

Statistically efficient linkage validation #3

Open brubinstein opened 6 years ago

brubinstein commented 6 years ago

Different communities validate linkages in a variety of ways. For example examining likelihood under estimated parameters, coefficients of variables under linear models (like Fellegi-Sunter). Or one might take an independent set, annotate it somehow with "ground truth" (perhaps thru some expensive process, acceptable due to limited scale), and evaluate some kind of accuracy statistic perhaps precision/recall (similarly sensitivity/specificity). A frequentist might like this sample statistic to be close to the population version, but achieving this is made challenging when datasets contain large numbers of records: the non/match classes grow incredibly imbalanced.

In some recent work, an RHD Neil Marchant (who is incidentally now interning in ABS-MD through AMSI) and I looked at some adaptive stratified importance sampling to help with the sampling piece. You'd like to quickly figure out which pairs of records (in a two-dataset setting) you should be sampling for annotation, so that you're not having to label an inordinate number of them to obtain good estimates of population parameters like precision/recall/sensitivity/specificity. We prove some asymptotic results of the resulting estimator in the VLDB'2017 paper, and have released the ideas as a Python package OASIS in PyPI (like CRAN for python).

jesse-jesse commented 6 years ago

I know of academics from Uni Adelaide and QUT that are both working in Record linkage, as well as people in the ATO interested in this space. it would be nice to find some cross over between the different people working in this area.

ngmarchant commented 6 years ago

Apart from the work that Ben described above, I'm also interested in hearing about Bayesian approaches to data linkage. I'm currently investigating this topic as part of my internship with ABS-MD.

mroughan commented 6 years ago

We have interests that can roughly be broken up into the categories (i) privacy and linkage (ii) linkage that is more than pairwise, using global operators and graph algebras (iii) statistical inference on linked data

We have a student starting, hopefully before the end of the year is visas can be sorted, on some combination of these topics funded through the D2D CRC. Her exact topic and direction will be sorted once she starts.