code-kern-ai / bricks

Open-source natural language enrichments at your fingertips.
Apache License 2.0
451 stars 23 forks source link

[MODULE] - Sentence complexity #2

Open jhoetter opened 2 years ago

jhoetter commented 2 years ago

Please describe the module you would like to add to the content library I know that some of the texts in my dataset are rather difficult to understand. In general, complexity of sentences can differ in my projects. I want to detect that.

Do you already have an implementation? If so, please share it here. For instance:

from typing import Dict, Any
import textstat

def setall(d, keys, value):
    for k in keys:
        d[k] = value

MAX_SCORE = 122
MIN_SCORE = 0

OUTCOMES = {}
setall(OUTCOMES, range(90, MAX_SCORE), "very easy")
setall(OUTCOMES, range(80, 90), "easy")
setall(OUTCOMES, range(70, 80), "fairly easy")
setall(OUTCOMES, range(60, 70), "standard")
setall(OUTCOMES, range(50, 60), "fairly difficult")
setall(OUTCOMES, range(30, 50), "difficult")
setall(OUTCOMES, range(MIN_SCORE, 30), "very difficult")

def get_mapping_complexity(score):
    if score < MIN_SCORE:
        return OUTCOMES[MIN_SCORE]
    return OUTCOMES[int(score)]

def fn_sentence_complexity(record: Dict[str, Any]) -> str:
    text = record["text"]

    language = record["language"]
    if language is not None:
        textstat.set_lang(language)

    sentence_complexity_score = textstat.flesch_reading_ease(text)
    sentence_complexity = get_mapping_complexity(sentence_complexity_score)
    return sentence_complexity

Additional context -

SvenjaKern commented 1 year ago

I am wondering what is meant by complexity? So is it refferened to the vocab, the morphology, the semantic or the syntax? Is it a mix of all? Or is it compered to the Language Niveaus Language Learner Style? Maybe we can find out and add it in the ReadMe. If I wonder, maybe the clients will, too.