HLasse / TextDescriptives

A Python library for calculating a large variety of metrics from text
https://hlasse.github.io/TextDescriptives/
Apache License 2.0
308 stars 23 forks source link

quality_test/contains doesn't function #346

Closed dvirnimrod closed 4 months ago

dvirnimrod commented 4 months ago

How to reproduce the behaviour

I try to set new quality thrseholds, I do as it specificed in the documentations (using "set_quality_thresholds"). When I run a quality test, I see that all the fields I'd tried to change did change, but to the "contains" field, which remain {"lorem_ipsum": False}. I tried some ways around it and couldn't mange to change specifically this test, no matter how simple is the dictionary I tried to replace it with. Moreover, I tried the default test on a text with "lorem_ipsum" and it passed the test, so nothing works (for me) with this test... Am I missing something?

Your Environment

HLasse commented 4 months ago

Can you add a code snippet that reproduces the behaviour?

KennethEnevoldsen commented 4 months ago

@dvirnimrod when I try to reproduce the stated behaviour I get the following behavior (python 3.8):

import textdescriptives as td

td.__version__
# 2.8.0
import spacy
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
docs = nlp.pipe(["lorem ipsum"])
doc = next(docs)
doc._.passed_quality_check
# False
doc._.quality
# QualityOutput(
#   passed=False, ...
#   contains={'lorem ipsum': ThresholdsOutput(value=1.0, passed=False, threshold=False)}, ...
dvirnimrod commented 4 months ago

Hi, thanks for the quick respond!

Here's a code snippet for example:

import textdescriptives as td
import spacy
from spacy.cli import download

QUALITY_THRESHOLDS = td.QualityThresholds(
    n_stop_words=(None, None),
    alpha_ratio=(0.6, None),
    mean_word_length=(3, 10),
    doc_length=(1, 1000),
    symbol_to_word_ratio={"@": (None, 0.3)},
    proportion_ellipsis=(None, None),
    proportion_bullet_points=(None, 0.7),
    contains={"fake": False},
    duplicate_line_chr_fraction=(None, 0.2),
    duplicate_paragraph_chr_fraction=(None, 0.2),
    duplicate_ngram_chr_fraction={
        "5": (None, 0.15),
        "6": (None, 0.14),
        "7": (None, 0.13),
        "8": (None, 0.12),
        "9": (None, 0.11),
        "10": (None, 0.1),
    },
    top_ngram_chr_fraction={"2": (None, 0.2), "3": (None, 0.18), "4": (None, 0.16)},
    oov_ratio=(None, 0.3)
)

download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")
quality_pipe = nlp.add_pipe("textdescriptives/quality")
quality_pipe.set_quality_thresholds(QUALITY_THRESHOLDS)

text = "This is fake @@@@@"
doc = nlp(text)
print(doc._.quality)

And here's the output:

passed=True 
    n_stop_words=ThresholdsOutput(value=2.0, passed=True, threshold=(None, None)) 
    alpha_ratio=ThresholdsOutput(value=0.75, passed=True, threshold=(0.6, None)) 
    mean_word_length=ThresholdsOutput(value=3.75, passed=True, threshold=(3.0, 10.0)) 
    doc_length=ThresholdsOutput(value=4.0, passed=True, threshold=(1.0, 1000.0)) 
    symbol_to_word_ratio={'#': ThresholdsOutput(value=0.0, passed=True, threshold=None)} 
    proportion_ellipsis=ThresholdsOutput(value=0.0, passed=True, threshold=(None, None)) 
    proportion_bullet_points=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.7)) 
    contains={'lorem ipsum': ThresholdsOutput(value=0.0, passed=True, threshold=None)} 
    duplicate_line_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)) 
    duplicate_paragraph_chr_fraction=ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)) 
    duplicate_ngram_chr_fraction={'5': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.15)), '6': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.14)), '7': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.13)), '8': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.12)), '9': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.11)), '10': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.1))} 
    top_ngram_chr_fraction={'2': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.2)), '3': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.18)), '4': ThresholdsOutput(value=0.0, passed=True, threshold=(None, 0.16))} 
    oov_ratio=ThresholdsOutput(value=0.25, passed=True, threshold=(None, 0.3))

As you can see, other attributes that I've set are updated to a new value (like "alpha_ratio" and "doc_length"), but the attributes "contains" and "symbol_to_word_ratio" haven't...

KennethEnevoldsen commented 4 months ago

Hi @dvirnimrod. The td.QualityThresholds have default for these. You can disable them e.g. by setting:

    ...
    contains = {} # nothing should be checked
    symbol_to_word_ratio = {} 
    ...

Edit: Aahh sorry It seems like a misread the code, @HLasse caught it though

HLasse commented 4 months ago

Ah, I see. It seems that .set_quality_threshold updates the thresholds correctly, but does not set self.contains and self.symbols (which it should). I'll take a look.

EDIT: Fixed in #353

dvirnimrod commented 4 months ago

Great! Thank you guys :)