samantar_parsers

Scrapers and Parsers for Indian Budget Speech Documents

Installation

Arch linux

sudo pacman -S tesseract

Ubuntu

sudo apt-get install tesseract-ocr

Arch linux

sudo pacman -S tesseract-data

Ubuntu

sudo apt-get install tesseract-ocr-all

python speech_scraper.py [--path]

It is recommended to convert pdfs to text files for better text extraction. HTML markups are messy to parse.

pdf2txt.py -o output.txt <pdf-file>

Convert PDF to image

python pdf2jpg.py --filename <input-file path> --path <output-file path>

For help:

python pdf2jpg.py --help

Convert the Image to PDF with text layer only

tesseract <img-filename> <pdf-filename> -l eng+hin test pdf

eg:

tesseract page.jpg test -l eng+hin pdf

Convert the above PDF to csv

tabula <pdf-file> -o <output-csv>