common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

What is the limit for "replacements"? #184

Closed HarikalarKutusu closed 1 month ago

HarikalarKutusu commented 1 year ago

So, we can pre-process Wiki sentences with replacements. One common usage is expanding abbreviations and acronyms. Of course, these are required for them to be useful in Common Voice.

In my own pre-processing scripts (for language models for example), I convert numbers to texts, run spellchecking and replace common misspellings etc.

On the other hand, these replacements change the original text. Where should we stop - legally?

MichaelKohler commented 1 year ago

@jessicarose hi Jessica, is this something you could figure out with Legal?

HarikalarKutusu commented 1 month ago

This has been implemented as a rule with PR #199