codezonediitj / pydatastructs

A python package for data structures and algorithms
https://pydatastructs.readthedocs.io/en/stable/
Other
199 stars 269 forks source link

Inverted Index Algorithm #213

Open DaliaAymanAmeen opened 4 years ago

DaliaAymanAmeen commented 4 years ago

Description of the problem

Searching through text documents is a process that requires lots of computational powers. For a given search keyword, the naive searching algorithm would be to pass through the corpus of documents iteratively and for each document, check whether the word exists in the document. This algorithm will be very slow for a large corpus of documents. If the corpus had ​D​ documents each having ​W​ words each of length ​L​ then the complexity of the naive approach would be O(D W L)​ (The complexity of naive search in each document is ​O(W*L)​ ). Searching algorithm will be optimized . The straightforward way is to build an index. For an index, All the words that occurred in a document will be stored. An inverted index works in the reverse way. Instead of storing the words that occurred in each document, The ids of the documents in which each word has occurred will be stored.

References/Other comments

https://www.elastic.co/blog/found-indexing-for-beginners-part3/ http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/ditp/ditp_ch4.pdf

czgdp1807 commented 4 years ago

Can you suggest a use case(example) for this algorithm through which we can decide what all is needed to get this algorithm implemented? The use case can include the inputs to a function(in which the algorithm you are suggesting is implemented) and the output of the function.

czgdp1807 commented 4 years ago

Is this document relevant to the algorithm?

DaliaAymanAmeen commented 4 years ago

Is this document relevant to the algorithm?

Yes it does. And in our design, the input to the function will be (the folder that contains document files, and the word to be found). As for the output it will be the number of the documents that contain that word and the names of the documents themselves. Let us know what do you think?

czgdp1807 commented 4 years ago

the input to the function will be (the folder that contains document files, and the word to be found).

So both string inputs, first one path to the folder and the second word to be found. Though exact meaning of document in the context of this algorithm need to be figured out. Wouldn't it be more natural and practical to have a list of strings and a word which is to be searched for in the list. Is this possible to do it? I will read the paper and will give you a more clear opinion on the feasibility of the algorithm.

czgdp1807 commented 4 years ago

So, this is what I understood of the requirements for the algorithm after going through first few pages of the document,

  1. The algorithm operates mostly operates on strings. The document can be anything a long string, a file, etc. The query is also a single term. That's just a higher level overview.
  2. Before implementing the above algorithm, we need to implement map-reduce. A mapper just maps the terms to positngs(pair of doc ids and payload). The reducer takes the results of all the mappers and combines them together.

So, we should go for a list of strings as a list of documents and the map reducer will generate the postings list for each term. What do you say on this?

Kiran-Venkatesh commented 3 years ago

i had studied Information retrieval and implementation and inverted index,So can I take up this issue to solve?

czgdp1807 commented 3 years ago

This issue isn't on priority right now. You may go through https://github.com/codezonediitj/pydatastructs/wiki/Planned-Features-for-v0.0.1 as it contains those features which we are interested to complete at the moment.