Samagra-Development / ai-tools

AI Tooling to bootstrap applications fast
44 stars 110 forks source link

Translating Odia words in English sentences #97

Closed Gautam-Rajeev closed 9 months ago

Gautam-Rajeev commented 1 year ago

Currently in our queries, we are seeing some Odia words in largely English sentences in the queries being asked. We want to be able to convert these Odia words to English and get queries completely in English.

We need to develop a solution that takes in a mixed-language sentence (dominated by English) and returns a sentence completely in English

We use the following model for transliteration (link).

rishav-eulb commented 1 year ago

@GautamR-Samagra I would like to work on it

ChakshuGautam commented 1 year ago

@rishav-eulb Please go ahead!!!

rishav-eulb commented 1 year ago

Do we have any dataset @ChakshuGautam @GautamR-Samagra

ChakshuGautam commented 1 year ago

Odia Dictionary.pdf We want to create a parser that parses this to a Dataset.

shrivastava95 commented 1 year ago

@GautamR-Samagra could you upload some examples of the kind of sentences that are causing problems? would be helpful for testing. It is a bit unclear to me what the process is and how transliteration figures into it.

shrivastava95 commented 1 year ago

The parsing of the dictionary into an unclean CSV file has been finished. Anyone interested in proceeding with the laborious task of data cleaning, please refer to the progress here: https://github.com/Samagra-Development/ai-tools/issues/106#issuecomment-1565186426 and feel free to reach out to me on discord - Virhs.#4402

EDIT: I have parsed a somewhat cleaner CSV file. This one should have lesser number of errors. Link to new CSV file - shrivastava95/odia-dictionary/mergegpt/parsed_dicts_merged/parsed_dict_merged_unclean.csv

Harsh-1309 commented 1 year ago

Hi, is this issue solved? If anything else needs to be done, I would like to take it up! @shrivastava95 @ChakshuGautam

wetleaf commented 1 year ago

Working to solve the translation problem from [en-or-lang-dictionary].pdf to text refer my repo: https://github.com/wetleaf/OdiaToEnglish.git Method used: 1) Converting pdfs into the images 2) Over every images, identifying horizontal and vertical lines to get the cropped images 3) Inside cropped_images, finding contour boxes (coordinates around words) and then applying tesseract over these word images [word_images.png] 4) Now using the default format like [englsh] [partofspeech] [translated lang], we can arrange the output words by knowing there language. Format can be changed in write_text() method described in get_text.py 5) To get the language, we can simply check whether the symbols are from ascii number 32 to 127 except some common characters. and then claim it wheter it is english or other language.

I used English to Koya Odia dictionary [Odia.Dictionary.pdf] and output from the page 6 to page 9 is in output.txt. It yield good accuracy.

Current Problems: 1) there are some words which are mistranslating 2) Not a Generic Approach 3) Use hack like koya character do not lies in ascii range of english 4) High Processing Time (approx 100 sec on every page on CPU)

wetleaf commented 1 year ago

Got another idea:

  1. convert pdf to page images
  2. Preprocess each image to get the words from the pages ( this part is done)
  3. Since tesseract works well when specified with one language. Following steps correctly outputs the text irrespective of language
    • say L1, L2, L3 ... Ln are possible languages and I is the word image
    • use tesseract to convert image I in all of the above language one by one. Say X1 = L1(I), X2 = L2(I) and so on
    • Now reverse back and produce image of same dimension of text X1, X2 .... Xn using PIL and downloading fonts of above languages (This part is done). Lets call this images I1,I2,I3....In
    • compare I with I1, I2, I3,... In and output the image with most similar text. ( to be done)
    • This way we get the language as well as the font

4) (Postprocessing) Now after getting all the text. Rearrange the text using the preprocessed information (position of word in the image, size, horizontal line, vertical line etc.) in whatever way required .(can be done easily)

Problem: Main issue is to compare two images. Since we can not determine the font of image I. Therefore, we can not produce the text with same font. Therefore, not techniques like MSE, SSIM will directly works. Need a alternative which detects whether two images contains same text or not. Any Idea for this comparison.

Possible Solution: Images can be compared using siamese neural network which is used for signature verification.

Example image: images I1 = engimage I2 = oriimage I = img

Gautam-Rajeev commented 1 year ago

@shrivastava95 can you link the work here with the tickets created recently and close this.