common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Replacements not considering whitespace #186

Closed HarikalarKutusu closed 1 year ago

HarikalarKutusu commented 1 year ago

I have several settings for abbr in replace array, such as:

  [" sf.", " sayfa"],
  [" dk.", " dakika"],
  [" ör.", " örnek"],

I did put a space in front of them to prevent substring replacement, but somehow it did not prevent that. Looking at the code, which does a simple global string-replace (and Rules are not pre-processed anywhere) I could not pinpoint the reason.

Here is a sample original text part:

..30 mermi kapasiteli yeni bir şarjör. 30 mermilik bu şarjörü...

And this is what it becomes:

..30 mermi kapasiteli yeni bir şarjörnek 30 mermilik bu şarjörü...

There are many samples of similar replacements, recognized them while checking non-dictionary words... Very weird...

MichaelKohler commented 1 year ago

Simplified test case: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a7c7bf52aa926a63c0bbbc3d87eeecfe

HarikalarKutusu commented 1 year ago

I'm sure Rust people got it covered :) Probably, in my case, other replacement items interfere (not that I could find one)... It shouldn't be the nltk code thou, it did not run at that phase.

I'm at the final stage of blacklisting, after that, I'll dive deeper into this.

MichaelKohler commented 1 year ago

Yeah, as this is standard Rust behavior (probably a nice exercise on the side to figure out why that is the behavior) I think we'd need to handle this on our side. I definitely can see why we'd want to allow a whitespace in the pattern to not interfere with occurances where it should not be replaced. Potential approaches here to fix this:

I'll leave this open as improvement, I think this would come in useful.

HarikalarKutusu commented 1 year ago

is there a way to replace only on word boundaries? Is this what we'd eventually want?

Not necessarily. In my scripts for Language Models, I had pre-, mid- and, post-replacement needs

pre & post: Mostly cleanup & correction & normalization - no words, full strings. mid: One or multi-word replacements. E.g.:

These, with all apostrophe/suffix possibilities!

A second problem here: What is a word boundary - which is valid for all languages? Think of apostrophes or abbreviations for example.

And related rules (e.g. for apostrophes or "a" vs "â" usage change in time) and source quality of those are also questionable (i.e. are the authors & editors of those Wiki articles good enough?). Half of my blacklist is spelling errors.

One could use word replacements for old texts for example (e.g. an author from 75 years ago uses "aceba" which is "acaba" in modern form, which can be handled), but Wiki is definitely chaotic.

should replacements additionally support Regex and therefore we could handle these cases with Regex?

I think this will be a good addition. On the other hand, they are risky to work on a full-scale resource.

The whole replace/regex business is open to errors and every change should be checked & tested extensively. E.g. I encountered the mentioned problem while checking words for blacklisting, some weird ones ending with "...örnek" or "...sayfa".

HarikalarKutusu commented 1 year ago

This did not happen on further tests, it was probably a mistake in my regex formats (changed a lot in time). You may like to close this - unless you may like to keep it open for the regex option - which would be a great option if someone tests them carefully.

MichaelKohler commented 1 year ago

I'm closing this for now, as it seems that now also the playground example has the correct behavior.