ericphanson / arxiv-search

Elasticsearch-backed rewrite of arxiv-sanity
MIT License
4 stars 1 forks source link

Recommendations perf #6

Open ericphanson opened 6 years ago

ericphanson commented 6 years ago

Right now, to rec papers based on a user's library, we send a list of all of the papers in the library to ES, which identifies keywords for those papers, and searches against them. This is slow for ES, because it ends up having to weight and search against many many keywords.

A better way seems to be to do some kind of topic modelling. Sam was telling me about it on the board, and I took a picture: img_5748

I forget how it all works, but on the technical side, we would process the papers 1 by 1, building sparse vectors which correspond to an understanding of which words are associated to which topics, and for each paper produce a spare vector corresponding to weights for that paper for each topic.

So a paper could be 20% topic 1 and 30% topic 2, etc.

We would run all the papers through this algo, identify say the top 10 topics for each paper, and upload those as a field on ES, keeping track of the weights.

Then when a user adds a paper to the library, we would grab the topics/weights, and keep an average interest vector for the user (or something like that). Then when we query elasticsearch, instead of passing a list of papers to find similar papers to, we would instead pass a list of top topics (with weights), to score the papers by.

This seems like it would be much more performant, scales well with the number of papers the user gets, and might give better recs too. We could try different topic modelling algs as well.

This is a different type of processing workflow, because (1) there is internal state (the model of which words correspond to what topics), and (2) we don't want to parallelize when learning the model from the corpus.

I propose an EC2 instance with it's own storage be dedicated to the task, along with a dedicated queue. Papers get thrown on the queue, processed one at a time by the EC2 instance, which then updates ES.

larsmennen commented 6 years ago

Fyi, I'm looking into the literature around topic modelling at the moment. Latent Dirichlet allocation seems promising, but I'm also looking at extensions so that we could potentially use the labels we already have (arxiv category) in the algorithm.

ericphanson commented 6 years ago

Very cool! πŸ‘πŸΌπŸ‘πŸΌ. By the way-- Ed and I were looking at the arxiv's 2018 roadmap ( https://confluence.cornell.edu/plugins/servlet/mobile?contentId=352650405#content/view/352650405). They are doing a lot, but no recommendations! On Wed, 14 Mar 2018 at 22:00, Lars Mennen notifications@github.com wrote:

Fyi, I'm looking into the literature around topic modelling at the moment. Latent Dirichlet allocation seems promising, but I'm also looking at extensions so that we could potentially use the labels we already have (arxiv category) in the algorithm.

β€” You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ericphanson/arxiv-search/issues/6#issuecomment-373189397, or mute the thread https://github.com/notifications/unsubscribe-auth/AFk15UdG8Vmdobn4vUQq--1jgm9zFaPtks5teZMLgaJpZM4SnOTk .

larsmennen commented 6 years ago

:+1: There's room for this then! :)

larsmennen commented 6 years ago

Some relevant papers:

I'll read the maths in the latter in more detail, that looks interesting. It combines a traditional matrix factorisation-based recommender systems with LDA to make paper recommendations. They claim that their method automatically balances between recommending papers based on similar libraries if a lot of such data is available (e.g. many similar libraries containing a particular paper) and content-based (e.g. topics) recommendations for papers where that is not the case. This would help in recommending papers that were just published, which I think will be important.

That paper also has quite some citations; not sure how much that says, but it at least means it has some track record.

ericphanson commented 6 years ago

"Collaborative Topic Modeling for Recommending Scientific Articles" sounds very much what we want-- however, I'm not entirely sold on basing the rec's based off other user's libraries. I get it's a potentially useful dataset, but (1) we have no users, and want the rec's to be useful off the bat, and (2), it's kind of nice having rec's based off word similarity alone. In particular, I'm worried about getting weird recs if there are only a few users with small libraries, whose preferences get undue influence since they aren't averaged out over a large userbase.

To elaborate on (2)-- I'm worried about bias and bubbles. In quant-ph, a common tool is https://scirate.com/ which lets people upvote and downvote papers. But what seems to happen (or at least I speculate happens..) is a few people who browse the arxiv via scirate every day upvote the papers they like (and those authored by their friends/collaborators), and then other people browse that day's papers sorted by rank, and only see the top few, which then get more upvotes.

Here, adding to your library is a bit like upvoting-- you're saying the paper is relevant to your interests. But probably you seed your library at the start by adding all your own papers and papers you're interested in (those of collaborators etc). So you're clustering those papers off the bat, which the algo will take and propagate. On the other hand, text-based-only isn't non-biased, because you probably use similar words/style to your collaborators which the algo then takes and uses.

Anyway, I just think we should be a bit careful before using user data over corpus data.

What do you think?