Open DaliaAymanAmeen opened 4 years ago
Can you suggest a use case(example) for this algorithm through which we can decide what all is needed to get this algorithm implemented? The use case can include the inputs to a function(in which the algorithm you are suggesting is implemented) and the output of the function.
Is this document relevant to the algorithm?
Yes it does. And in our design, the input to the function will be (the folder that contains document files, and the word to be found). As for the output it will be the number of the documents that contain that word and the names of the documents themselves. Let us know what do you think?
the input to the function will be (the folder that contains document files, and the word to be found).
So both string inputs, first one path to the folder and the second word to be found. Though exact meaning of document in the context of this algorithm need to be figured out. Wouldn't it be more natural and practical to have a list of strings and a word which is to be searched for in the list. Is this possible to do it? I will read the paper and will give you a more clear opinion on the feasibility of the algorithm.
So, this is what I understood of the requirements for the algorithm after going through first few pages of the document,
So, we should go for a list of strings as a list of documents and the map reducer will generate the postings list for each term. What do you say on this?
i had studied Information retrieval and implementation and inverted index,So can I take up this issue to solve?
This issue isn't on priority right now. You may go through https://github.com/codezonediitj/pydatastructs/wiki/Planned-Features-for-v0.0.1 as it contains those features which we are interested to complete at the moment.
Description of the problem
Searching through text documents is a process that requires lots of computational powers. For a given search keyword, the naive searching algorithm would be to pass through the corpus of documents iteratively and for each document, check whether the word exists in the document. This algorithm will be very slow for a large corpus of documents. If the corpus had D documents each having W words each of length L then the complexity of the naive approach would be O(D W L) (The complexity of naive search in each document is O(W*L) ). Searching algorithm will be optimized . The straightforward way is to build an index. For an index, All the words that occurred in a document will be stored. An inverted index works in the reverse way. Instead of storing the words that occurred in each document, The ids of the documents in which each word has occurred will be stored.
References/Other comments
https://www.elastic.co/blog/found-indexing-for-beginners-part3/ http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/ditp/ditp_ch4.pdf