Living-with-machines / TargetedSenseDisambiguation

Repository for the work on Targeted Sense Disambiguation
MIT License
1 stars 0 forks source link

urgent bugfixes #125

Closed kasparvonbeelen closed 3 years ago

kasparvonbeelen commented 3 years ago

Closes #107 Closes #116 Closes #112

kasparvonbeelen commented 3 years ago

@fedenanni @mcollardanuy @BarbaraMcG I tried to fix most of the apparent bugs, even though this entailed quite some choices that require some careful review.

mcollardanuy commented 3 years ago

Hi @kasparvonbeelen,

  • 107 #116 should be straightforward. For #116 I removed all lemmas that only have one sense

Thanks! I could have a look if you want.

  • 112 is a bit trickier. I was a bit more careful with applying time filters to quotations in binarize function. Imagine, for example, that our target period is 1800-1850, but sense Y has only quotations from an earlier (or later) period. In this case, the sense is "alive" but there are no observations within this date range, which would result in empty train,dev,test sets? What I did is allow the training data to contain senses outside of the defined date range, but filter observation for the test set (in this case we can use information outside of our target period to build sense embedding for the target period).

  • Question: maybe filtering quotations by date doesn't make much sense when selecting and binarizing data. We should select all quotations for senses that are current in the target period? This would be easy to fix, just not sure what's the best solution?

So at the moment we are using all quotations of the senses that are within our target period when training, and testing only on quotations from the target period, did I understand correctly? (This sounds good to me)

kasparvonbeelen commented 3 years ago

Good morning @mcollardanuy !

Thanks! I could have a look if you want.

If you can? They are just a few lines.

So at the moment we are using all quotations of the senses that are within our target period when training, and testing only on quotations from the target period, did I understand correctly? (This sounds good to me)

Correct! It also allows us to have more training data.

mcollardanuy commented 3 years ago

Hi @kasparvonbeelen all changes look very good!

I was wondering about your question above and having had a look at some examples I can see it's tricky, as filtering on quotation means that we're reducing the test set quite dramatically for some of the senses (e.g. 900 quotations to 90, and maybe biased towards certain senses). So I was wondering if we could have line 285 as an option for our experiments (but maybe for the future, not for now).

kasparvonbeelen commented 3 years ago

@mcollardanuy I was wondering the same. It's just that for these experiments, we care a lot about the time a quotation was written, and if we relax this condition in line 285 we can't really know? But I agree it's sensible to make it optional.

BarbaraMcG commented 3 years ago

I agree with both of you that it would be interesting to try both options and see what the results say