chatcontract / public-issues

This repository is to log all the issues reported by our users.
0 stars 0 forks source link

Error when there are empty pages in PDF or DOC #297

Open samyak112 opened 5 days ago

samyak112 commented 5 days ago

Description

There are multiple cases when there are empty pages in pdf or doc file

  1. When an empty (meaning there is not text at all) .docx file is uploaded it is characterized as Corrupted but if a empty .pdf file is uploaded then it is not marked as Corrupted and when this file is tried to be analyzed it causes an error.
  2. When a document is uploaded where one page is empty in between then our OCR removes that page and merge's the content of the pages this can make a different pdf then the original pdf uploaded (Need to discuss if we need to solve this)

Solution

When an empty .docx file or empty .pdf file is uploaded the frontend should clearly mark that pdf as Empty and not corrupted and should not be allowed for further analysis just like Corrupted PDFs

samyak112 commented 3 days ago

Pull Request

https://github.com/chatcontract/django-ml-backend/pull/177