LSYS / LexicalRichness

:smile_cat: :speech_balloon: A module to compute textual lexical richness (aka lexical diversity).
http://lexicalrichness.readthedocs.io/
MIT License
96 stars 19 forks source link

Possible error with Yule's I depending on length #79

Closed mreygal closed 1 year ago

mreygal commented 1 year ago

DESCRIPTION: I encountered what may be a bug when using yulei with a .txt file. It seems that Yule's I score is dependent on the length of the text, which shouldn't be. This is probably expected to happen with Yule's K as well.

MY MACHINE AND SETUP:

TEXT USED: As an example, I used Don Quijote in plain text (found here).

EXPECTED RESULTS: A Yule I score somewhere in between 40 and 70, in accordance to the lexical richness of the text.

ACTUAL RESULTS: I run a few tests, and these are the results I got:

It seems the longer the text, the lower its Yule I score will be.

STEPS TO REPRODUCE: 1. Use the terminal to go to a directory of your machine: "cd name_of_directory" 2. Create a Python virtual environment inside said directory: "virtualenv venv" 3. Activate the virtual environment: "source venv/bin/activate" 4. Install LexicalRichness on it: "pip3 install lexicalrichness" 5. Unzip the unzip_this.zip attached file, and you will find two files. Place lexical.py and example.txt (which is the Don Quijote file from here) inside the same directory where the virtual environment is activated. 6. Be on that directory and run on your terminal the following: "python3 lexical.py" 7. Check the results printed on the terminal and edit the example.txt file to test different sections of the text so that they match my tests found above.

SUGGESTION: Please, do check the lexical.py file inside the unzip_this.zip file for errors, since the way in which I implemented the text-reading code could be what is causing the problem.

unzip_this.zip

LSYS commented 1 year ago

Try a completely different implementation of Yule's I like the one here (https://swizec.com/blog/measuring-vocabulary-richness-with-python/) and you will get similar output (see below; differences arise from tokenization). Also, that certain measures should be "robust" to differences in text length does not imply the measure literally remains unchanged as you vary the size of the text.

from nltk.stem.porter import PorterStemmer
from itertools import groupby

def words(entry):
    return filter(lambda w: len(w) > 0,
                  [w.strip("0123456789!:,.?(){}[]") for w in entry.split()])

def yule(entry):
    # yule's I measure (the inverse of yule's K measure)
    # higher number is higher diversity - richer vocabulary
    d = {}
    stemmer = PorterStemmer()
    for w in words(entry):
        w = stemmer.stem(w).lower()
        try:
            d[w] += 1
        except KeyError:
            d[w] = 1

    M1 = float(len(d))
    M2 = sum([len(list(g))*(freq**2) for freq,g in groupby(sorted(d.values()))])

    try:
        return (M1*M1)/(M2-M1)
    except ZeroDivisionError:
        return 0

(Source: https://swizec.com/blog/measuring-vocabulary-richness-with-python/)

image