There are 3274 pdf files with East Circassian text. We need to extract the text from the pdf files for further processing.
Difficulties
There are a large amount of pdf files that makes it infeasable to do it by hand.
The structure of the pdf content is not straight forward, such issues:
The text integrity is compromised, the text should stay intact:
i.e Ярин Андрей абы ЭТущэшхуэр ---- should be ----> Ярин Андрей абы теухуа зэӀушӀэшхуэр
passing this sentence to the processing step, we won't be able to recover the original text.
Another issue that is an artifact from pdf are hyphens at the end of a line:
i.e зы- гъэхьэзыра ------ should be ------> зыгъэхьэзыра
The text flow should be preserved, if not possible, then it should not be included.
Solution
A script in Python or shell automatically extract all text from all the 3274 pdf files and put it into a single file, the text should be semi processed to remove common issues from extracting text from pdf files.
The script should be saved in src/extraction/kbd/ap
The semi-processed file should be saved in data/interim/kbd/kbd/1.txtdata/interim/kbd/kbd/references.md should be added with the entry: - 1.txt - Adiga Psatha newspaper, cutoff date 19-10-2024 (http://www.smikbr.ru/arhivap)
Description
There are 3274 pdf files with East Circassian text. We need to extract the text from the pdf files for further processing.
Difficulties
There are a large amount of pdf files that makes it infeasable to do it by hand. The structure of the pdf content is not straight forward, such issues:
The text integrity is compromised, the text should stay intact: i.e Ярин Андрей абы ЭТущэшхуэр ---- should be ----> Ярин Андрей абы теухуа зэӀушӀэшхуэр passing this sentence to the processing step, we won't be able to recover the original text.
Another issue that is an artifact from pdf are hyphens at the end of a line: i.e зы- гъэхьэзыра ------ should be ------> зыгъэхьэзыра
The text flow should be preserved, if not possible, then it should not be included.
Solution
A script in Python or shell automatically extract all text from all the 3274 pdf files and put it into a single file, the text should be semi processed to remove common issues from extracting text from pdf files.
The script should be saved in
src/extraction/kbd/ap
The semi-processed file should be saved indata/interim/kbd/kbd/1.txt
data/interim/kbd/kbd/references.md
should be added with the entry:- 1.txt - Adiga Psatha newspaper, cutoff date 19-10-2024 (http://www.smikbr.ru/arhivap)