Possible error with Yule's I depending on length

DESCRIPTION: I encountered what may be a bug when using yulei with a .txt file. It seems that Yule's I score is dependent on the length of the text, which shouldn't be. This is probably expected to happen with Yule's K as well.

MY MACHINE AND SETUP:

MacBook Air M1
MacOS Ventura 13.5
Pip 23.2.1
Python 3.11.4

TEXT USED: As an example, I used Don Quijote in plain text (found here).

EXPECTED RESULTS: A Yule I score somewhere in between 40 and 70, in accordance to the lexical richness of the text.

ACTUAL RESULTS: I run a few tests, and these are the results I got:

First paragraph: Yule's I score of 17.537... and a total of 558 words.
First chapter: Yule's I score of 10.390... and a total of 1909 words.
Second chapter: Yule's I score of 12.070... and a total of 2232 words.
First and second chapters together: Yule's I score of 8.116... and a total of 4141 words.
Whole text file: Yule's I score of 0.513... and a total of 186992 words.

It seems the longer the text, the lower its Yule I score will be.

STEPS TO REPRODUCE: 1. Use the terminal to go to a directory of your machine: "cd name_of_directory" 2. Create a Python virtual environment inside said directory: "virtualenv venv" 3. Activate the virtual environment: "source venv/bin/activate" 4. Install LexicalRichness on it: "pip3 install lexicalrichness" 5. Unzip the unzip_this.zip attached file, and you will find two files. Place lexical.py and example.txt (which is the Don Quijote file from here) inside the same directory where the virtual environment is activated. 6. Be on that directory and run on your terminal the following: "python3 lexical.py" 7. Check the results printed on the terminal and edit the example.txt file to test different sections of the text so that they match my tests found above.

SUGGESTION: Please, do check the lexical.py file inside the unzip_this.zip file for errors, since the way in which I implemented the text-reading code could be what is causing the problem.

unzip_this.zip

Try a completely different implementation of Yule's I like the one here (https://swizec.com/blog/measuring-vocabulary-richness-with-python/) and you will get similar output (see below; differences arise from tokenization). Also, that certain measures should be "robust" to differences in text length does not imply the measure literally remains unchanged as you vary the size of the text.

from nltk.stem.porter import PorterStemmer
from itertools import groupby

def words(entry):
    return filter(lambda w: len(w) > 0,
                  [w.strip("0123456789!:,.?(){}[]") for w in entry.split()])

def yule(entry):
    # yule's I measure (the inverse of yule's K measure)
    # higher number is higher diversity - richer vocabulary
    d = {}
    stemmer = PorterStemmer()
    for w in words(entry):
        w = stemmer.stem(w).lower()
        try:
            d[w] += 1
        except KeyError:
            d[w] = 1

    M1 = float(len(d))
    M2 = sum([len(list(g))*(freq**2) for freq,g in groupby(sorted(d.values()))])

    try:
        return (M1*M1)/(M2-M1)
    except ZeroDivisionError:
        return 0

(Source: https://swizec.com/blog/measuring-vocabulary-richness-with-python/)

LSYS / LexicalRichness

Possible error with Yule's I depending on length #79