Caucasus-Rosetta / Lingua-Corpus

Caucasus languages focused multilingual and monolingual corpuses for Natural Language Processing(NLP)
Apache License 2.0
33 stars 6 forks source link

Extract all text from Adiga Psatha #114

Open danielinux7 opened 1 month ago

danielinux7 commented 1 month ago

Description

There are 3274 pdf files with East Circassian text. We need to extract the text from the pdf files for further processing.

Difficulties

There are a large amount of pdf files that makes it infeasable to do it by hand. The structure of the pdf content is not straight forward, such issues:

  1. The text integrity is compromised, the text should stay intact: i.e Ярин Андрей абы ЭТущэшхуэр ---- should be ----> Ярин Андрей абы теухуа зэӀушӀэшхуэр passing this sentence to the processing step, we won't be able to recover the original text.

  2. Another issue that is an artifact from pdf are hyphens at the end of a line: i.e зы- гъэхьэзыра ------ should be ------> зыгъэхьэзыра

  3. The text flow should be preserved, if not possible, then it should not be included. text_flow

Solution

A script in Python or shell automatically extract all text from all the 3274 pdf files and put it into a single file, the text should be semi processed to remove common issues from extracting text from pdf files.

The script should be saved in src/extraction/kbd/ap The semi-processed file should be saved in data/interim/kbd/kbd/1.txt data/interim/kbd/kbd/references.md should be added with the entry: - 1.txt - Adiga Psatha newspaper, cutoff date 19-10-2024 (http://www.smikbr.ru/arhivap)