NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
13 stars 3 forks source link

Add Leipzig affiliations raw dataset #34

Closed OlivierBinette closed 4 months ago

OlivierBinette commented 6 months ago

I think the Affiliations dataset from the Leipzig university group could be a good addition, especially for testing embedding-based or LLM-based methods. The ground truth labeling is not perfect, but that's part of what we have to deal with in ER.

This PR adds the csv dataset and preparation script to the _datasets folder.

Here is a sample from the csv:

record_id,label_true,affiliation 7927,7927,", IBM Almaden Research Center, 650 Harry Road, CA 95120, San Jose, USA" 7930,7930,", IIT Bombay" 7987,7987,", University of California, San Diego, USA" 5613,5613,"28msec Inc., Zurich, Switzerland" 9530,5613,"28msec, Inc."

NickCrews commented 4 months ago

Thank you @OlivierBinette ! Sorry this took so long to get to. This looks great!