Closed kasparvonbeelen closed 3 years ago
@fedenanni @mcollardanuy @BarbaraMcG I tried to fix most of the apparent bugs, even though this entailed quite some choices that require some careful review.
binarize
function. Imagine, for example, that our target period is 1800-1850, but sense Y has only quotations from an earlier (or later) period. In this case, the sense is "alive" but there are no observations within this date range, which would result in empty train,dev,test sets? What I did is allow the training data to contain senses outside of the defined date range, but filter observation for the test set (in this case we can use information outside of our target period to build sense embedding for the target period).Hi @kasparvonbeelen,
107 #116 should be straightforward. For #116 I removed all lemmas that only have one sense
Thanks! I could have a look if you want.
112 is a bit trickier. I was a bit more careful with applying time filters to quotations in
binarize
function. Imagine, for example, that our target period is 1800-1850, but sense Y has only quotations from an earlier (or later) period. In this case, the sense is "alive" but there are no observations within this date range, which would result in empty train,dev,test sets? What I did is allow the training data to contain senses outside of the defined date range, but filter observation for the test set (in this case we can use information outside of our target period to build sense embedding for the target period).- Question: maybe filtering quotations by date doesn't make much sense when selecting and binarizing data. We should select all quotations for senses that are current in the target period? This would be easy to fix, just not sure what's the best solution?
So at the moment we are using all quotations of the senses that are within our target period when training, and testing only on quotations from the target period, did I understand correctly? (This sounds good to me)
Good morning @mcollardanuy !
Thanks! I could have a look if you want.
If you can? They are just a few lines.
So at the moment we are using all quotations of the senses that are within our target period when training, and testing only on quotations from the target period, did I understand correctly? (This sounds good to me)
Correct! It also allows us to have more training data.
Hi @kasparvonbeelen all changes look very good!
I was wondering about your question above and having had a look at some examples I can see it's tricky, as filtering on quotation means that we're reducing the test set quite dramatically for some of the senses (e.g. 900 quotations to 90, and maybe biased towards certain senses). So I was wondering if we could have line 285 as an option for our experiments (but maybe for the future, not for now).
@mcollardanuy I was wondering the same. It's just that for these experiments, we care a lot about the time a quotation was written, and if we relax this condition in line 285 we can't really know? But I agree it's sensible to make it optional.
I agree with both of you that it would be interesting to try both options and see what the results say
Closes #107 Closes #116 Closes #112