Caucasus-Rosetta / Lingua-Corpus

Caucasus languages focused multilingual and monolingual corpuses for Natural Language Processing(NLP)
Apache License 2.0
31 stars 5 forks source link

Corpus alignment #106

Closed Bachstelze closed 7 months ago

Bachstelze commented 1 year ago

There are possible multilingual alignments pivot the ab-ru parallel corpus for the bible, human rights declaration and the Quran. The Johns Hopkins University Bible Corpus: 1600+ Tongues for Typological Exploration and The eBible Corpus: Data and Model Benchmarks for Bible Translation for Low-Resource Languages probably aligned the new testament with other parallel bibles. Though I couldn't find publicy if Abkhaz is really included.

Bachstelze commented 1 year ago

I got a reponse that asks for a file in USFM: "As another alternative – if you have the original source files for the Abhkaz and Russian NT translations, and they are in USFM format, our tools could run over them using that format. The output would be files that are lined up to match each of the other translations in the eBible corpus."

danielinux7 commented 10 months ago

Hello Kalle, The links to the original sources in Abkhazian can be found here: https://github.com/danielinux7/Abkhaz-NLP-Data-Pipeline/blob/master/data/ab-ru/references.md (2,4 and 14) I should assume they have a Russian version in USFM format included.

Bachstelze commented 8 months ago

I couldn't find a tool to convert the USFM format.

danielinux7 commented 7 months ago

If no progress in this direction, am going to close issue.