idf problem - Githubissues

HIIT / dime-server

Your humble Digital Work Me (DiMe) server.

http://hiit.github.io/dime-server/

Other

13 stars 7 forks source link

idf problem #48

Open msjoberg opened 8 years ago

msjoberg commented 8 years ago

When a documents get a lot of ReadingEvents it (or parts of it) will be indexed many times, thus reducing the inverse document frequency. This should probably be fixed in DiMe's indexing.

One way is to somehow modify the idf function in Lucene: http://www.lucenetutorial.com/advanced-topics/scoring.html

agisbrec commented 8 years ago

I am not sure whether this is necessary. If a document is accessed often, then it is probably highly relevant and should be marked as such. Where it might be problematic, is if a document is accessed on a regular basis, i.e. every day, then an upper limit to the doc frequency might be necessary.

An alternative would be to mark this document as such in dime and not index it in Lucene every time.

jmakoske commented 8 years ago

Yes, but the effect is inverse: the ReadingEvents reduce the idf score of the terms appearing in them, as every ReadingEvent is a new document and idf gives high values to terms that appear only in a small number of documents.

agisbrec commented 8 years ago

You are right, thanks for the clarification. One could overload idf() in TFIDFSimilarity to return a constant value for all terms. https://lucene.apache.org/core/5_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html Unfortunately this would brake the search for all documents. I think this is not the right way to go.

Is it necessary to index the reading events? Would it make sense to enter a query and get multiple copies of the same document? I think the document should be indexed only once and the reading events should be treated as feedback, that the document is relevant.

mvsjober commented 8 years ago

As it is currently implemented, ReadingEvents are indexed, but you do not retrieve multiple documents, instead we map them to their corresponding documents and the highest ranking version of that document is retained. In this way if the search query is matched particularly well with the text read in the ReadingEvent it will push the corresponding document up.

Of course this effect could be gained with some other mechanism, but I don't want to lose the connection to the ReadingEvent, for example some application may want to highlight the read passage for matching documents.