This Python project scrapes raw PDF data containing MHT CET college and branch cutoffs, extracts the relevant information, and creates a JSON file. Additionally, it generates a "skipped" folder with pageNo.txt
files for lines that couldn't be understood and are excluded from the JSON data. The final output is an Excel file (output.xlsx
) containing organized cutoff data.
main.py
.DataMigrater.py
to create the final(output.xlsx
) Excel file.sudo apt-get update
sudo apt-get install python3-pip
pip3 install pypdf openpyxl
pip install pypdf openpyxl
pip3 install pypdf openpyxl
Feel free to contribute or report issues on GitHub!
The out
folder in this repository contains the following files:
Sample PDF (2023 CET CAP Round 1 Cut-off): You can find the raw PDF file containing MHT CET college and branch cutoffs for the 2023 CAP Round 1. This is the input file that the Python program processes.
Final Output (output.xlsx): After running the main.py
script and executing the data extraction process, the program generates an Excel file named output.xlsx
. This file contains organized and structured cutoff data for colleges and branches.