Caucasus-Rosetta / Lingua-Corpus

Caucasus languages focused multilingual and monolingual corpuses for Natural Language Processing(NLP)
Apache License 2.0
33 stars 6 forks source link

Prepare sentences from CC0 text #77

Closed danielinux7 closed 3 years ago

danielinux7 commented 3 years ago

Ахҳәаа

This text is to be used in Common Voice for recording.

Ауадаҩрақәа

The sentences shouldn't be longer than ~7 words, it should cleaned up from extra symbols.

Аӡбара

  1. Clean up extra symbols.
  2. Split text into smaller parts if possible.

https://github.com/danielinux7/Public-Domain-Abkhaz/commit/5aa32e5d6ed32b24ee5b2d43eb2aaf0891f6ced4

Азхьарԥшқәа:

1.https://www.systutorials.com/how-to-sort-lines-by-length-in-linux/