Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
101 stars 18 forks source link

Additional filter suggestion: remove lines with repeated content #8

Closed yvesscherrer closed 2 years ago

yvesscherrer commented 3 years ago

Not sure how useful this is, but this is an idea that came to mind when filtering backtranslations.

Sentences like the following are probably low-quality and should be removed: Ahora bien, el que quiera ser el primero entre ustedes deberá ser su servidor, diferentes plantas para ser un buen pescador y un buen pescador para ser un buen pescador y un buen pescador para ser un buen pescador y un buen pescador para ser un buen pescador

Parameters would be:

radinplaid commented 2 years ago

I have made a prototype that seems to work well in practice:

import math

def find_repeats(x, lengths_to_check = [1,2,3,4], min_repeat_length=3):
    """
    Identifies repeated phrases, for use in identifying stuttering in NMT output

    Arguments:
    x (str, list(str), required) -- input to search for repeats in
    lengths_to_check (list(int), int, default [1,2,3,4]) -- length of sequence of tokens to search for
    min_repeat_length (int, default 3) -- minimum number of times the repeat must occur
    """
    # Input validation
    if isinstance(x, str):
        # x is not tokenized; tokenize by space
        x_split = x.split()
    elif isinstance(x, list):
        # x is already tokenized
        x_split = x
    else:
        raise TypeError("Input must be str or list(str)")

    # If lengths_to_check is an int, make it a list so it can be iterated over
    if isinstance(lengths_to_check, int):
        lengths_to_check = [lengths_to_check]

    # Loop over each token in the string, from left to right
    for ind in range(len(x_split)):
        # Check for phrase repeats of length i for i in lengths_to_check
        for phrase_len in lengths_to_check:
            if ind+phrase_len < len(x_split):
                if x_split[ind:(ind+phrase_len)] == x_split[(ind+phrase_len):(ind+2*phrase_len)]:
                    # We have a match - check to see how many times it repeats
                    num_repeats = 1
                    found_match=True
                    match_idx = ind+phrase_len
                    while found_match:
                        if x_split[ind:(ind+phrase_len)] == x_split[match_idx:(match_idx+phrase_len)]:
                            num_repeats+=1
                            match_idx+=phrase_len
                        else:
                            found_match=False
                    # Return once a single match that is long enough is found; do not find all matches
                    if num_repeats >= min_repeat_length:
                        return {"match":' '.join(x_split[ind:(ind+phrase_len)]),"num_repeats":num_repeats, "repeat_length":phrase_len}

    # No matches were found
    return None

> find_repeats("Ahora bien, el que quiera ser el primero entre ustedes deberá ser su servidor, diferentes plantas para ser un buen pescador y un buen pescador para ser un buen pescador y un buen pescador para ser un buen pescador y un buen pescador para ser un buen pescador", lengths_to_check = [1,2,3,4,5,6,7,8,9,10])

>>> {'match': 'para ser un buen pescador y un buen pescador',
 'num_repeats': 3,
 'repeat_length': 9}

Should I modify it to be an OpusFilter filter and submit a pull request ?

svirpioj commented 2 years ago

Thanks! I had some extra backtranslated data from Yves to test this, and indeed it seems to be working nicely (at least with good precision, recall is of course more difficult to estimate). Found 4778 matches from 164725 segments.

So sure, go on and create a PR! (Some instructions here.) I can also help, but better you make at least the first commit so you get the credit :slightly_smiling_face:

svirpioj commented 2 years ago

I made a bit different solution in #35. Dunno if it's better than the above, but at least required less code, and should work directly with languages without marked word boundaries.