Ezhil-Language-Foundation / open-tamil

Open Source Tamil NLP Tools - தமிழ் இயற்கை மொழி பகுப்பாய்வு நிரல்தொகுப்பு
http://tamilpesu.us
MIT License
264 stars 80 forks source link

OpenTamilWebApp - search engine for Project Madurai corpus #169

Open arcturusannamalai opened 6 years ago

arcturusannamalai commented 6 years ago

Update / 2021: Blog post and source https://bart.degoe.de/building-a-full-text-search-engine-150-lines-of-code/ allows making a search module index using bag-of-words and term-frequency/IDF approach. We can use that technique to build a code module based off https://github.com/bartdegoede/python-searchengine/

Search engine for Project Madurai corpus:

Search by name, author, title, genre etc already implemented in MinMadurai [மின் மதுரை] https://github.com/Ezhil-Language-Foundation/MinMadurai

We need to have full text search by building a index of words/phrases to documents, and a concordance database built for Madurai corpus. Two of these items will provide a good cross referencing and utility for research purposes.

e.g. We can also show number of words, characters, paragraphs in a given work of text in Project Madurai corpus. This is basic. Researcher/user wants to know how many documents are referencing 'கணிதம்' or 'மோக்‌ஷம்' ? What are sentences referencing 'மணப்பெண்' and 'மாமியார்' ? What are documents written by a specific author ?

This has a good ability to form a framework that can load the index and corpus files [separately] for each corpus, and analyze within Open Tamil.

Further if you add NLP parsing to query beautiful things will happen :-)

abuvanth commented 5 years ago

We can use elastic search,

Reference : https://blog.patricktriest.com/text-search-docker-elasticsearch/

Example app - https://search.patricktriest.com/

Source code - https://github.com/triestpa/guttenberg-search

thanks

arcturusannamalai commented 4 years ago

Thanks for thinking about his project @abuvanth - currently it is in idle state until someone chooses to run it. My projects for rest of this year upto 2020 are https://ezhillang.blog/2019/09/21/ஆடுகளம்-2020/

ssurenr commented 4 years ago

Where can we find WIP repo for this? Looks like something that I jump-start without having to read a lot 😀

On Mon, Sep 23, 2019 at 6:37 AM Muthiah Annamalai notifications@github.com wrote:

Thanks for thinking about his project @abuvanth https://github.com/abuvanth - currently it is in idle state until someone chooses to run it. My projects for rest of this year upto 2020 are https://ezhillang.blog/2019/09/21/ஆடுகளம்-2020/

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Ezhil-Language-Foundation/open-tamil/issues/169?email_source=notifications&email_token=AB6QHY3IRYMHGDZN6GKRUMDQLAJDFA5CNFSM4FPS5NAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7JTNNA#issuecomment-533935796, or mute the thread https://github.com/notifications/unsubscribe-auth/AB6QHY23JNQE4LJJAMCED4LQLAJDFANCNFSM4FPS5NAA .