Scrapers and Parsers for Indian Budget Speech Documents
Arch linux
sudo pacman -S tesseract
Ubuntu
sudo apt-get install tesseract-ocr
Arch linux
sudo pacman -S tesseract-data
Ubuntu
sudo apt-get install tesseract-ocr-all
python speech_scraper.py [--path]
It is recommended to convert pdfs to text files for better text extraction. HTML markups are messy to parse.
pdf2txt.py -o output.txt <pdf-file>
Convert PDF to image
python pdf2jpg.py --filename <input-file path> --path <output-file path>
For help:
python pdf2jpg.py --help
Convert the Image to PDF with text layer only
tesseract <img-filename> <pdf-filename> -l eng+hin test pdf
eg:
tesseract page.jpg test -l eng+hin pdf
Convert the above PDF to csv
tabula <pdf-file> -o <output-csv>