Adding information about how many sub-word tokens make up a word

Issue

There is no interim output from MinimalPair analysis that contains information about how many subword tokens each word was aggregated over.

Motivation

Having this information can be helpful if we want to filter instances where predictability was estimated over multiple tokens (e.g., from Newman et al "to enable a fairer comparison between LMs and masked LMs, we only consider lemma where both inflections are in the wordpiece vocabulary of the models"

To do

Create a column in byROI that indicates the number of subword tokens that were summed over. Will require editing src/analysis/MinimalPair.py

forrestdavis / NLPScholar