eehab-saadat / PlagPatrol

The utlimate responsive web based OCR-embedded plaigirism checker which generates PDF plaigirism reports and supports handwritten documents. Created as an end-of-semester project.
3 stars 0 forks source link

File Parser & Tokenizer Code #1

Closed eehab-saadat closed 1 year ago

eehab-saadat commented 1 year ago

File Parser & Tokenizer Code

Create the file parser to add the file parsing utility which processes data from allowed file formats, namely being, .DOCX and .TXT, and converts them into a readable string which is further tokenized and converted into an iterable string list of phrases for input, split on ending punctuation, to be further serviced into the webscrapper & plagiarism checker module. Check possibility for multi-lingual support.

Here's a check-list to ensure correct implementation of features:

Perform tests and submit a stable build for the sub-module, as a python module stored within the /utils/ folder for further usage, before the the sub-module completion phase deadline.