forrestdavis / NLPScholar

Tools for training an NLP Scholar
GNU General Public License v3.0
5 stars 2 forks source link

Adding information about how many sub-word tokens make up a word #10

Open grushaprasad opened 1 week ago

grushaprasad commented 1 week ago

Issue

There is no interim output from MinimalPair analysis that contains information about how many subword tokens each word was aggregated over.

Motivation

Having this information can be helpful if we want to filter instances where predictability was estimated over multiple tokens (e.g., from Newman et al "to enable a fairer comparison between LMs and masked LMs, we only consider lemma where both inflections are in the wordpiece vocabulary of the models"

To do

Create a column in byROI that indicates the number of subword tokens that were summed over. Will require editing src/analysis/MinimalPair.py