I think the Affiliations dataset from the Leipzig university group could be a good addition, especially for testing embedding-based or LLM-based methods. The ground truth labeling is not perfect, but that's part of what we have to deal with in ER.
This PR adds the csv dataset and preparation script to the _datasets folder.
Here is a sample from the csv:
record_id,label_true,affiliation
7927,7927,", IBM Almaden Research Center, 650 Harry Road, CA 95120, San Jose, USA"
7930,7930,", IIT Bombay"
7987,7987,", University of California, San Diego, USA"
5613,5613,"28msec Inc., Zurich, Switzerland"
9530,5613,"28msec, Inc."
I think the Affiliations dataset from the Leipzig university group could be a good addition, especially for testing embedding-based or LLM-based methods. The ground truth labeling is not perfect, but that's part of what we have to deal with in ER.
This PR adds the csv dataset and preparation script to the _datasets folder.
Here is a sample from the csv:
record_id,label_true,affiliation 7927,7927,", IBM Almaden Research Center, 650 Harry Road, CA 95120, San Jose, USA" 7930,7930,", IIT Bombay" 7987,7987,", University of California, San Diego, USA" 5613,5613,"28msec Inc., Zurich, Switzerland" 9530,5613,"28msec, Inc."