cpllab / syntaxgym-core

DEPRECATED: Command-line tool and Python API for targeted syntactic evaluation of language models
MIT License
10 stars 1 forks source link

Test suite metrics #4

Closed AnneBeyer closed 3 years ago

AnneBeyer commented 3 years ago

Are you working on adding other metrics? I'd like to be able to compute the mean surprisal for regions that may differ in length. So far, my approach is to change the hard-coded metric in predictions.py (line 171), but that is not very elegant and does not work for the * version (aggregate over all regions). Do you have any ideas on how to solve this in a less hacky way?

hans commented 3 years ago

Hi, mean surprisal (or perplexity, I suppose?) is a good idea! We can implement this.


The test suites we designed all have matched lengths in the critical measurement regions, I believe. Just curious -- are you willing to share an example test where it is not possible to match lengths?

does not work for the * version (aggregate over all regions)

Are you currently using a test which makes a * reference? I see that there can be different implementations here, e.g.

If you have opinions on this, let me know!

AnneBeyer commented 3 years ago

Hi, we are currently studying entity chains, and we would like to test something like

region1: The woman went to the store because
region2: she/the woman
region3: was out of coffee

where it would be interesting to look at differences in region 2, 3, and/or the total perplexity of the two conditions.

As we always have the same number of regions, taking the sum or mean of region-level surprisal means should not make a difference (if I'm not missing anything here?). But it could give different results than the summed surprisals aggregated over all regions normalized by the summed number of tokens in each region, right?

Do you have any thoughts on what would make more sense in this context?

hans commented 3 years ago

Hi @AnneBeyer , after taking a look at the code, we actually already support a mean metric, if only unofficially! This should be as simple as changing meta.metric to mean in your test suite spec. With this metric (as with any metric), * in prediction expressions will refer to the sum across the region-level metric values for that sentence.


As we always have the same number of regions, taking the sum or mean of region-level surprisal means should not make a difference (if I'm not missing anything here?). But it could give different results than the summed surprisals aggregated over all regions normalized by the summed number of tokens in each region, right?

You're right, I didn't list the options correctly in my previous message. The critical difference is between macro-averaging among regions and micro-averaging among all tokens within the regions, as you described.

where it would be interesting to look at differences in region 2, 3, and/or the total perplexity of the two conditions.

I don't think I fully understand the test. I can understand why you'd be interested in region 2 differences, since this could test a sort of coreference capacity of a model. But what is critical to measure in region 3? And what is being tested when comparing all region means across conditions?

Please don't feel obligated to explain everything at the moment -- I can also read your paper when it comes out :) But if you think it might have some bearing on the ideal behavior with predictions involving *, I'll have to know a little more about your test to provide input.

AnneBeyer commented 3 years ago

This should be as simple as changing meta.metric to mean in your test suite spec.

Almost, in predictions.py (line 171) only sum is accepted. I've created a pull request with a suggestion for fixing this, please check if this makes sense (@hans).

And sure, I'll be happy to share more of our ideas once the paper is ready. :)

hans commented 3 years ago

I see. Thanks for the PR and good luck! I'll push a new release including your fix right now.