HLasse / TextDescriptives

A Python library for calculating a large variety of metrics from text
https://hlasse.github.io/TextDescriptives/
Apache License 2.0
308 stars 23 forks source link

quality.duplicate_ngram_fraction(...) expected values ? #343

Open sondalex opened 5 months ago

sondalex commented 5 months ago

I have added unit test for duplicate_ngram_fraction on my fork's branch add_quality_unittest . The test fails which makes wonder whether my test expected values are wrong or whether the current behaviour is not intended.

The report of my test pytest tests/test_quality.py -k test_duplicate_ngram_fraction:

============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.4.0
rootdir: /home/sondalex/alex/Source/TextDescriptives
configfile: pyproject.toml
collected 50 items / 49 deselected / 1 selected

tests/test_quality.py F                                                  [100%]

=================================== FAILURES ===================================
________________________ test_duplicate_ngram_fraction _________________________

nlp = <spacy.lang.en.English object at 0x7efec116b730>

    def test_duplicate_ngram_fraction(nlp):
        text = "This is a test. This is another test."
        doc = nlp(text)
        span = doc[0:]
        result = duplicate_ngram_fraction(span, (2, 3))

        expected = {2: (len("This is") * 2 + len("test.") * 2) / len(span.text), 3: 0}
>       assert result == expected
E       assert {2: 0.6756756...56757, 3: 0.0} == {2: 0.6486486486486487, 3: 0}
E         
E         Omitting 1 identical items, use -vv to show
E         Differing items:
E         {2: 0.6756756756756757} != {2: 0.6486486486486487}
E         Use -v to get more diff

tests/test_quality.py:161: AssertionError
=========================== short test summary info ============================
FAILED tests/test_quality.py::test_duplicate_ngram_fraction - assert {2: 0.67...
======================= 1 failed, 49 deselected in 2.18s =======================

With a bit more debug information:

------------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------------
Spans of ngrams with count higher than 1:
This is
This is
test.
test.

-------------------

Spans tagged as duplicates:
range=(0, 2), span=This is
range=(3, 7), span=test. This is
range=(8, 10), span=test.

...

(result for n=2)

Should the spans be equivalent across the two for loop blocks ?

Respectively, 1) https://github.com/HLasse/TextDescriptives/blob/93b1d59278dc86cd4e8df80247e8a1a5afda2f59/src/textdescriptives/components/quality.py#L262-L270 2) https://github.com/HLasse/TextDescriptives/blob/93b1d59278dc86cd4e8df80247e8a1a5afda2f59/src/textdescriptives/components/quality.py#L274-L276

In this case, the function could be modified as such:

def duplicate_ngram_fraction(
    span: Union[Span, Doc],
    ngram_range: Tuple[int, int],
) -> Dict[int, float]:
    """
    ...
    """
    chr_len = len(span.text)
    if chr_len == 0:
        return {n: 0.0 for n in range(ngram_range[0], ngram_range[1] + 1)}
    shingles_count = span_ngrams(span, ngram_range)
    duplicate_chr_fraction = {}
    for ngram_size, ngrams in shingles_count.items():
        duplicate_chars = 0
        for ngram, count in ngrams.items():
            if count["count"] > 1:
                for ngram_span in count["span"]:
                    duplicate_chars += ngram_span.end_char - ngram_span.start_char

        duplicate_chr_fraction[ngram_size] = duplicate_chars / chr_len
    return duplicate_chr_fraction
HLasse commented 5 months ago

Thanks for the report! @KennethEnevoldsen, this is something you and Dan worked on - can you take a look?

KennethEnevoldsen commented 5 months ago

Hi @sondalex, thanks for the report. The reason why we use the boolean array is to avoid counting duplicates twice. E.g. imagine the sentence: "a sentence. a sentence."

It is clearly all duplicates so n_duplicate_characters = n_characters and the fraction 1/1. However, you can get in a situation where you count some characters twice duplicate 2-gram ("a sentence", "sentence.") to get a n/1 fraction where n>1. A solution to this would simply be to create a boolean array of characters (instead of the current array of tokens) and set the respective characters to True (I believe this would also be faster).

However with your specific example: range=(3, 7), span=test. This is, does not seem to be a duplicate, which seems to a bug, though I can't see a clear point where that error should happen.

HLasse commented 5 months ago

Is this bug something you'd have time to take a closer look at, @KennethEnevoldsen? Otherwise, a PR is more than welcome, @sondalex!

HLasse commented 4 months ago

Kenneth is busy with MTEB sprint and I'm wrapping up my dissertation, so neither of us have the time to look into this right now. You're more than welcome to submit a PR if you find a solution for this, @sondalex, otherwise it might have to wait a while before we get the chance to take a deeper look.

sondalex commented 4 months ago

Hi @KennethEnevoldsen, thank you for your explanation and thank you @HLasse for following up. I am planning to look more in depth into this when I have some time. I will definitely PR once I find a solution.