[Discussion] Query Strategy (Recommender + C4)

mgrani commented 9 years ago

While search result quality significantly improved from the alpha release, we still have issues to deal with.

Current Situation (Briefly Summarized) (see Profile Definition and Query Strategy Page for details)

Clients extract:

Noun Phrases
Entities (locations, persons)
Time
Keywords

The profile is filled based on the personal preferences of a user (e.g. personalized Noun Phrase, Entity, Time and Keyword selection) and send to the recommender. The recommender creates a query per partner system following the query generation strategy.

Problem Setting: As discussed in Chrome Extension Issue 14 our queries are still to broad because they (i) cover multiple topics, (ii) sources do not provide a huge coverage of all topics resulting in very unspecific answer to specific queries and (iii) heterogeneity of the ranking and tuning of partner systems.

Proposed Solution: This issue should serve as discussion point for additional measure we can take into account to improve queries. My proposal would be as follows: We need to extend the profile such that clients can define a main topic of the query. Consider the paragraph on the History of Liestal on Wikipedia. This paragraph covers topics like the Roman Empire, Napoleon, and the Rhine. All three are very broad topics, but in the context of Liestal they are narrowed down to only a few results. Simply compare the search Liestal AND Napoleon on Europeana vs. Liestal OR Napoleon on Europeana. Quite distinctive results. Given Liestal as a main topic, we could potential generate very specific queries, i.e. Liestal AND (Napoleon OR Roman AND Empire OR Rhine).

As a consequence

The recommender would need to extend the queries by a main topic and AND this main topic with all other terms ORed, i.e. MAIN TOPIC AND (TOPIC1 OR TOPIC2 OR....) in case no results are found over all Partner systems, the query can be relaxed by a strategy that has to be defined. Queries without main topic should be handled as in the current approach.
The clients can extract the main topic by taking the entity or noun phrase or Keyword with the highest frequency and or significance.

Feedback please!

hziak commented 9 years ago

I think on option to tackle this issue might be using the weight field in the secureuserprofile context keywords. For partners with an Lucene backend we can just pass the weight, for other partners we can write other query formulation strategies like proposed above.

mgrani commented 9 years ago

Not sure. Weights do not provide a solution for partner systems that do not support weights, yielding to bad recommender quality. Boolean retrieval is the common denominator of the partner systems and i think we have to work around that.

Defining a main topic could be easy by adding another boolean property isMainTopic to ContextKeywords or ContextEntities. However, before doing that we should

Identify a full spectrum of possible solutions
Know which solution works in our settings.

hziak commented 9 years ago

True, but i meant was that we can spare an extra field like isMainTopic by making use of the weight field in the query that we have already defined. That should work without changing the dataformat. As far as i know the weight field is not really used yet, please correct me if that is wrong.

e.g.

contextKeywords: [ { text: "Liestal" weight: 5 }, { text: "Napoleon" weight: 1 }, { text: "Roman AND Empire" weight: 1 },

Lucene : Liestal^5 OR Napoleon^1 OR (Roman AND Empire)^1

Others: Liestal AND (Napoleon OR Roman AND Empire OR Rhine)

So the context keyword with the highest weight is handled as the main topic for partners that have no boosting functionality.

mgrani commented 9 years ago

Ok, would be a solution, but an implicit one. Hard to figure out for ppl writing clients and there is no way to abstain from having a main topic. What do you do when all weights are equal? Are all contextKeywords main topics? Further, do weights count over categories (Keywords, Entities etc.)?

hziak commented 9 years ago

Good points but on the other hand from the view of a developer of a partner recommender, what does the secure user profile tell me? I would have weights and a mainTopic flag. Does main topic overrule the weight? What happens if the weight for normal terms is higher then the weight of the main topic? What to do if the main topic is not set?

Actually i think all these things are in the hand of the developers working on the partnerRecommender or the frontend. Even if we have an extra flag for the main topic we can't force a frontend developer to use it. So as developer of the partner one always has to consider the fact that this field might be missing/not set.

(ok ... actually we could force them but that might not be a good idea)

philgooch commented 9 years ago

Is it worth considering keyword weighting based on position? E.g. keyword noun phrases occurring within the title and first few paragraphs being of greater importance. Also, for the specific use case of Wikipedia, if we ignore the first 'blurb' paragraph, and then take keyword NPs that the author has linked to other Wikipedia articles, which could be considered of greater importance.

For the History of Liestal section of the Liestal article, we'd get (where bracketed terms imply AND and non-bracketed OR)

(Liestal) (Roman Rhine (St Gotthard Pass) (Burgundian Wars) (Charles Bold) Rheinfelden Habsburgs Napoleon (July Revolution))

e.g. for Europeana

Liestal AND (Roman OR Rhine OR (St AND Gotthard AND Pass) OR (Burgundian AND Wars) OR (Charles AND Bold) OR Rheinfelden OR Habsburgs OR Napoleon OR (July AND Revolution) )

which gives reasonable results

mgrani commented 9 years ago

@philgooch, yes, good idea. Would be interesting to do that. However, there is no way for telling the specifics of a query to the recommender.

@hziak: the detailed implementation is of course recommender specific, but i honestly do not care about implementation specifics. It is a conceptual question. Having an additional flag allows clients expressing more semantics for the recommender. Just taking the weights send by the client and passing them to the partner recommender will not work imho. Just run through the example provided in the Chrome Extension Issue 14. Giving Liestal a weight of 10^6 would not change the outcome, because all items at KIM contain the word Liestal and only 1 item contains an additional term from that query (Napoleon). So weights will do nothing in that case, while a boolean AND will do.

However, if the recommender is able to infer the best queries by setting only weights i am fine with it. So we agree that on the client we will simply set the weights and the recommender will return better results and resolve issues as mentioned in the Chrome Extension Issue 14?

hziak commented 9 years ago

ok sorry, i think i missed something here. Adding the flag and changing the query formulation is not a problem. I will make the changes in the next days when it's agreed on.

mgrani commented 9 years ago

Ok. We will discuss it in the Passau Meeting. Also Phil's approach.

mgrani commented 8 years ago

The following strategy was aggreed in terms of the recommender support:

Support a main topic (issue 18)
Support special purpose queries (issue 19)

EEXCESS / recommender

[Discussion] Query Strategy (Recommender + C4) #13