common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Adding regex replacement feature #202

Open raivisdejus opened 1 year ago

raivisdejus commented 1 year ago

Adding another replacement option, that can process regexs. This can be used to split longer sentences into smaller chunks.

In my tests for Latvian, this can yield ~25% more sentences in the final output.

MichaelKohler commented 1 year ago

Thanks for submitting this! I will have a closer look at this PR tomorrow.