KaniyamFoundation / ProjectIdeas

A Place to write down the project ideas and to plan them
37 stars 3 forks source link

OCR Cleaning VBA script for Microsoft Word #195

Open IngersolNorway opened 4 months ago

IngersolNorway commented 4 months ago

Hi Friends,

I'm seeking assistance in crafting a VBA script for Microsoft Word, primarily focusing on find and replace functionality. Specifically, I aim to refine OCR documents. like removing spaces after the character "க்".

For instance:

Sub FindAndReplace() Dim myDoc As Document Set myDoc = ActiveDocument

' Define the text to find and replace
Dim findText As String
Dim replaceText As String
findText = "old text"
replaceText = "new text"

' Perform the find and replace operation
myDoc.Content.Find.Execute findText:=findText, replaceWith:=replaceText, _
    Replace:=wdReplaceAll

End Sub

Thank you for your assistance.

Best regards, Ingersol

tshrinivasan commented 4 months ago

Please provide the list of all items to cleanup. We can write macro or python script for cleanup.

Note - Remove space after the character "க்" may not correct always.

Example - வாழைக் குலை

கலைக் கல்லூரி

can have space.

IngersolNorway commented 4 months ago

Instructions for formatting in Microsoft Word:

  1. Replace Line Breaks with a space.
  2. Replace Tabs with Spaces.
  3. Reduce Multiple Spaces with a single space.
  4. Remove Front Space of , . ' " : ; ? !.
  5. Add a space after , . - ' " : ; ? ! ) ] } >.
  6. Add Front Space: Add a space before (( [ { <).
  7. Add Both Side Space to / \ & % $ # @ * + = < > | ~.
  8. Remove Space Before (க், ங், ச், ஞ், ட், ண், த், ந், ப், ம், ய், ர், ல், வ், ழ், ள், ற், ன், ஜ், ஷ், ஸ், ஹ், க்ஷ்).
  9. Delete Space After: For specific Tamil characters like க், ச், த், ப், delete spaces after them.
  10. Highlight Characters.
  11. Highlight numbers.
  12. Highlight English alphabets.
  13. Highlight long words.
  14. Highlight sentences.
  15. Highlight symbols.
  16. Remove Multiples: Remove instances of consecutive characters or symbols such as ……… ,,,,,,,,,,,,, ----------.
  17. Highlight Similar Alphabets: Identify and highlight similar Tamil characters, like ஏ, எ, ஐ, கௌ, ஓ, ஒ, கு, சூ, டி, டூ, ணு, னூ, து, நு, தூ, நூ, பு, யு, பூ, யூ, மு, ழு, வ, ல, க, ச, க, சு, த, ந, O, 0, l, 1, S, 5).
IngersolNorway commented 4 months ago

All should be optional to run

image