Closed Gautam-Rajeev closed 9 months ago
@GautamR-Samagra I would like to work on it
@rishav-eulb Please go ahead!!!
Do we have any dataset @ChakshuGautam @GautamR-Samagra
Odia Dictionary.pdf We want to create a parser that parses this to a Dataset.
@GautamR-Samagra could you upload some examples of the kind of sentences that are causing problems? would be helpful for testing. It is a bit unclear to me what the process is and how transliteration figures into it.
The parsing of the dictionary into an unclean CSV file has been finished. Anyone interested in proceeding with the laborious task of data cleaning, please refer to the progress here: https://github.com/Samagra-Development/ai-tools/issues/106#issuecomment-1565186426 and feel free to reach out to me on discord - Virhs.#4402
EDIT: I have parsed a somewhat cleaner CSV file. This one should have lesser number of errors. Link to new CSV file - shrivastava95/odia-dictionary/mergegpt/parsed_dicts_merged/parsed_dict_merged_unclean.csv
Hi, is this issue solved? If anything else needs to be done, I would like to take it up! @shrivastava95 @ChakshuGautam
Working to solve the translation problem from [en-or-lang-dictionary].pdf to text refer my repo: https://github.com/wetleaf/OdiaToEnglish.git Method used: 1) Converting pdfs into the images 2) Over every images, identifying horizontal and vertical lines to get the cropped images 3) Inside cropped_images, finding contour boxes (coordinates around words) and then applying tesseract over these word images [word_images.png] 4) Now using the default format like [englsh] [partofspeech] [translated lang], we can arrange the output words by knowing there language. Format can be changed in write_text() method described in get_text.py 5) To get the language, we can simply check whether the symbols are from ascii number 32 to 127 except some common characters. and then claim it wheter it is english or other language.
I used English to Koya Odia dictionary [Odia.Dictionary.pdf] and output from the page 6 to page 9 is in output.txt. It yield good accuracy.
Current Problems: 1) there are some words which are mistranslating 2) Not a Generic Approach 3) Use hack like koya character do not lies in ascii range of english 4) High Processing Time (approx 100 sec on every page on CPU)
Got another idea:
4) (Postprocessing) Now after getting all the text. Rearrange the text using the preprocessed information (position of word in the image, size, horizontal line, vertical line etc.) in whatever way required .(can be done easily)
Problem: Main issue is to compare two images. Since we can not determine the font of image I. Therefore, we can not produce the text with same font. Therefore, not techniques like MSE, SSIM will directly works. Need a alternative which detects whether two images contains same text or not. Any Idea for this comparison.
Possible Solution: Images can be compared using siamese neural network which is used for signature verification.
Example image: images I1 = engimage I2 = oriimage I = img
@shrivastava95 can you link the work here with the tickets created recently and close this.
Currently in our queries, we are seeing some Odia words in largely English sentences in the queries being asked. We want to be able to convert these Odia words to English and get queries completely in English.
We need to develop a solution that takes in a mixed-language sentence (dominated by English) and returns a sentence completely in English
We use the following model for transliteration (link).