Evaluation - Githubissues

MrTyton commented 8 years ago

How exactly do we want to evaluate this? I have the data split, but what's our measuring metric? Remove Y stories from the profile, recommend based on the rest, generate X recommended, and then something like % of the Y stories that we hit? Times the maximum of X maybe? So if we're trying to predict 4 stories, we get 2 of them, out of 10 possible recommendations, then we have 2/4 * 4/10? Or we could do MMR if the recommendations have an order to them. Or we could just do Recall, just the amount that we actually got that are correct, 2/4., but I think that we'd want some way of scaling with the number of recommendations that we make.

depthfirst commented 8 years ago

What do you mean by having the data split? I talked to Chetan about this yesterday, and his suggestion was to take 5% of a users' favorites away from training, then rank the entire corpus for that user (minus the favorites in training, I presume) and take the average of where the held out stories fell in the ranked list. I was thinking MRR as well, it's just going to be reaaaalllly small. I've been using perplexity to evaluate topic modeling, but no one really knows what perplexity means.

MrTyton commented 8 years ago

I split the data into 60/20/20 by authors. Then in the dev/test set, for each of the author's, i removed 1/3 of their favorites and stuck them in a different list. and yeah, the ranking thing is something I was thinking as well, but depending on how we do stuff we might not get a 'ranking' per say; k-means clustering isn't (likely) to provide us a ranking of stories, for instance.

On Tue, Nov 24, 2015 at 9:52 PM John Blackmore notifications@github.com wrote:

What do you mean by having the data split? I talked to Chetan about this yesterday, and his suggestion was to take 5% of a users' favorites away from training, then rank the entire corpus for that user (minus the favorites in training, I presume) and take the average of where the held out stories fell in the ranked list. I was thinking MRR as well, it's just going to be reaaaalllly small. I've been using perplexity to evaluate topic modeling, but no one really knows what perplexity means.

— Reply to this email directly or view it on GitHub https://github.com/MrTyton/Fanfiction/issues/4#issuecomment-159471796.

depthfirst commented 8 years ago

That's easy, many metrics to use for ranking, distance to cluster centroid, cosine similarity, etc.

MrTyton commented 8 years ago

Ok, sure, I'll have that finish written up in a few hours.

On Tue, Nov 24, 2015 at 11:06 PM John Blackmore notifications@github.com wrote:

That's easy, many metrics to use for ranking, distance to cluster centroid, cosine similarity, etc.

— Reply to this email directly or view it on GitHub https://github.com/MrTyton/Fanfiction/issues/4#issuecomment-159486140.

depthfirst commented 8 years ago

I wrote an evaluation function, sort of, but not in a generic way. First attempt with 100 readers, 5% of their favorites held out, evaluated against 100K stories. We need a hyperfast scoring function to do the entire corpus for each reader. Overall Results (MRR): 1.7724e-03. That's just comparing new stories to favorite stories by their topic proportions, using 1-cosine similarity.
I don't think your data split is suitable for unsupervised learning. You could have just removed favorites and skipped splitting the data into sets.

MrTyton commented 8 years ago

Is that 5% rounded up? Because there's a lot of people who only have like, 2-3 favorites.

I can do that removal too, you want me to?

On Wed, Nov 25, 2015 at 7:43 AM John Blackmore notifications@github.com wrote:

I wrote an evaluation function, sort of, but not in a generic way. First attempt with 100 readers, 5% of their favorites held out, evaluated against 100K stories. We need a hyperfast scoring function to do the entire corpus for each reader. Overall Results (MRR): 1.7724e-03. That's just comparing new stories to favorite stories by their topic proportions, using 1-cosine similarity.

I don't think your data split is suitable for unsupervised learning. You could have just removed favorites and skipped splitting the data into sets.

— Reply to this email directly or view it on GitHub https://github.com/MrTyton/Fanfiction/issues/4#issuecomment-159597017.

depthfirst commented 8 years ago

I wrote the query to fetch only the authors with 5 or more favorites, and if 5%=0, I set it to 1. Seems like you already removed 1/3 of favorites in dev and test? That should be more than enough. Just combine all the sets and use those authors with removed favs for testing... maybe add some favorites back.

MrTyton commented 8 years ago

Do you want me to do the removal for the train set as well?

On Wed, Nov 25, 2015 at 8:12 AM John Blackmore notifications@github.com wrote:

I wrote the query to fetch only the authors with 5 or more favorites, and if 5%=0, I set it to 1. Seems like you already removed 1/3 of favorites in dev and test? That should be more than enough. Just combine all the sets and use those authors with removed favs for testing... maybe add some favorites back.

— Reply to this email directly or view it on GitHub https://github.com/MrTyton/Fanfiction/issues/4#issuecomment-159603141.

MrTyton / Fanfiction-Recommendation

Evaluation #4