INK-USC / shifted-label-distribution

Source code for paper "Looking Beyond Label Noise: Shifted Label Distribution Matters in Distantly Supervised Relation Extraction" (EMNLP 2019)
https://arxiv.org/abs/1904.09331
Apache License 2.0
39 stars 2 forks source link

label distribution estimation #2

Open ShellingFord221 opened 4 years ago

ShellingFord221 commented 4 years ago

Hi, in your paper, for the estimation of test distribution p(r|Dm), you use "maximum likelihood estimation on a held-out clean dev set, which is a 20% sample from test set". I wonder how you do this in your code? Thanks!

ShellingFord221 commented 4 years ago

And also why do you use maximum likelihood estimation? I think the easiest way is just to count the number of each class in the held-out clean dev set, since the distribution in 20% sampled test data can represent the distribution in the whole test set. Thanks!

cherry979988 commented 4 years ago

Hello @ShellingFord221 . Thank you for your questions!

https://github.com/INK-USC/shifted-label-distribution/blob/3cf2b7ced3b2e18234db405f6014f049c4830d71/Neural/eva.py#L102

Around these lines we load a clean dev set (cdev_dset) and store its distribution in the variable cdev_lp.

For discrete random variables, we believe maximum likelihood estimation will give frequency as the estimated distribution, which is what we do here, and is equivalent to counting method you mentioned.

I hope this answers your question :-)