common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Add rule to remove content between brackets #199

Closed HarikalarKutusu closed 1 year ago

HarikalarKutusu commented 1 year ago

This PR adds a new rule to optionally remove brackets from sentences before they are checked for other rules.

It will handle the issues mentioned in #198 ...

It will work like "replaces", before replaces and before any rule-check, so you can also run it with --no_check. This way you get all sentences but with content in parentheses removed.

You specify opening and closing brackets as a list, and they are executed from top to bottom.

[
  ["(", ")"],
  ["[", "]"],
]

Tested on Turkish Wikipedia, with the same rules turning this feature off and on (no blacklist & no export limitation - i.e. 3 sentence/article limit removed): 5.8% increase in the number of total possible sentences (947,526 => 1,002,522)

EDITS:

MichaelKohler commented 1 year ago

Sorry for the delay here and thanks for submitting this PR. I hope I can get some time in the next few days to review.

HarikalarKutusu commented 1 year ago

As specified above, there are two problems with this implementation:

  1. It does not take care of nested parentheses, it only handles the inner one. One could add a second regex call, but it might be an unnecessarily costly one.
  2. This might result in what you call "broken whitespace" cases if the parentheses are not surrounded by spaces.

Any solution you can find during your examination is very much welcome.

HarikalarKutusu commented 1 year ago

Let me try to comment on these first:

What's the reason for preventing multi-line matches? While generally we probably shouldn't have new-lines anyway, if we do I probably would expect it to work nevertheless.

OK, I missed the following line in the code. I thought we were working on the whole "text" field. Actually the segmenter works on paragraphs. My mistake. So it should not be needed.

https://github.com/common-voice/cv-sentence-extractor/blob/fef72ab03a5e514a8e6d931fdb7f3bc5e6c1164d/src/loaders/wikipedia.rs#L38

But if there were new-lines, I wouldn't like to have too many paragraphs removed because of typos.

In my opinion these "turn-on-turn-off" style rules should cover the full use case and should not need to be combined with other rules. Therefore I would argue that it should take care of multi paranethesis with eagerly matching them, resulting in a sentence without parenthesis.

One solution would be to run it in a loop and check if the result is the same as the input to break the loop.

Additionally I would also say that it should take care of whitespaces correctly so it's not needed to combine it with other rules.

We can replace (anything) with space and remove any double spaces with regex afterward. But there might be missed cases.

maybe rename it to something more generic

A more generic name would be "bracket". I quickly examined the Wikipedia article. There seem to be a huge number of possibilities (see the Unicode list in the article). The normal parentheses are abundant in Latin script, but I cannot check if the code works in non-Latin languages. Francis also said he did not work with Chinese or others. But as far as I can see, he recently updated commonvoice-utils repo to handle some like Hangul.

This part really needs teamwork - which is why I did not go that way.

A second thing: I'm not sure it can be handled with a simple regex as you mentioned. Probably we would need to feed pairs.

HarikalarKutusu commented 1 year ago

I suppose you have also tested this with your other efforts for Turkish? Did you notice anything off by now?

I tested it with () and [] combined, without any other replacements/rules, and compared the result with original sentences. All matching brackets were removed, and everything else was intact. There were a few sentences out of a million with a single "(" left because the original had non-matching ones, e.g. "This has ((non-matching) brackets".

One point on this: I do not allow these brackets along with my alphabet as I want them removed, so these out-of-control ones get eliminated.

HarikalarKutusu commented 1 year ago

@MichaelKohler, all is OK for now. I hope this might help the lower-resourced Wikis as sentence count and the text-corpus quality in CV for all.