KaniyamFoundation / ProjectIdeas

A Place to write down the project ideas and to plan them
37 stars 3 forks source link

OCR CLEANING #177

Open IngersolNorway opened 2 years ago

IngersolNorway commented 2 years ago

Can anyone write a program to clean OCR texts

PREFERED PYTHON CLEANING THROUGH https://colab.research.google.com/

INPUT - TAMIL TEXT FILE, *.TXT

CLEANING BY FIND AND REPLACE METHODS

  1. REPLACE MULTIPLE LINE BREAK TO ONE LINE BREAK
  2. REMOVE SPACES AFTER (,[,{
  3. REMOVE SPACES IN FRONT OF ,.:;)]}
  4. INTRODUCE FRONT & BACKSPACE / - =
  5. REMOVE MULTIPLE SPACES TO ONE SPACE
  6. REMOVE SPACES INFRONT OF ( க், ங், ச், ஞ், ட், ண், த், ந், ப், ம், ய், ர், ல், வ், ழ், ள், ற், ன், ஜ், ஷ், ஸ், ஹ், க்ஷ்)
  7. REMOVE SPACE AFTER க், ங், ஞ், ச், ட், த், ந், ப், ற், வ்
  8. REMOVE LINEBREAK AFTER =
  9. REPLACE MULTIPLE _:-/=.,.:;)([]{} TO ONE
  10. REPLACE; TO :
  11. REPLACE ,. TO .
  12. REPLACE ., TO .
  13. DELETE |