Refefer / fastxml

FastXML / PFastXML / PFastreXML - Implementation of Extreme Multi-label Classification
Other
149 stars 47 forks source link

How to perform Performance Evaluation?? #7

Closed hncheung23 closed 6 years ago

hncheung23 commented 6 years ago

Sorry, I am new to multi-labeling. I want to know how can I perform performance evaluation on the testing dataset in terms of accuracy. Also, can someone explain how to read the result of the prediction? What I see is that "Label{...................} Predict {................}".

Any help would be greatly appreciated.

Refefer commented 6 years ago

@hncheung23 if you're using fxml.py, you can pass in the --score flag when using 'inference'. it should add the scores to the actual resulting output.

Can you post a snippet of the results you're finding confusing?

hncheung23 commented 6 years ago

@Refefer Thanks for your reply. You are really helpful. Actually, I am doing my final year project with extreme multi-labeling. Therefore, I have a lot of things to ask. Sorry about that.

{"labels": [31, 33, 67], "predict": [["67", -0.07443475723266602], ["33", -0.14996011555194855], ["66", -0.677146315574646], ["31", -0.8062863349914551], ["65", -1.2875503301620483], ["34", -1.4505908489227295], ["24", -1.5509711503982544], ["93", -1.5873050689697266], ["51", -1.842068076133728], ["84", -1.842068076133728]], "ndcg": [1.0, 0.7653606369886217, 0.9674679834891693], "precision": [1.0, 0.6666666666666666, 0.6], "pSndcg": [1.05152, 0.8064944879943291, 0.745928174727603]}

How can I interpret this kind of information?

P@1: 0.7777777777777778 P@3: 0.5925925925925926 P@5: 0.4666666666666667 NDCG@1: 0.7777777777777778 NDCG@3: 0.6474225761047898 NDCG@5: 0.6614064326473282 "labels": [31, 33, 67] <- means actual labels?

predict": [["67", -0.07443475723266602], ["33", -0.14996011555194855], ["66", -0.677146315574646],.............. <- means predicted label with log loss?

Refefer commented 6 years ago

It's easier to look at it in a pretty printed format:

{ "labels": [ 31, 33, 67
],
"predict": [ [ "67", -0.07443475723266602 ], [ "33", -0.14996011555194855 ], [ "66", -0.677146315574646 ], [ "31", -0.8062863349914551 ], [ "65", -1.2875503301620483 ], [ "34", -1.4505908489227295 ], [ "24", -1.5509711503982544 ], [ "93", -1.5873050689697266 ], [ "51", -1.842068076133728 ], [ "84", -1.842068076133728 ] ],
"ndcg": [1, 0.7653606369886217, 0.9674679834891693],
"precision": [1, 0.6666666666666666, 0.6 ],
"pSndcg": [1.05152, 0.8064944879943291, 0.745928174727603 ] }

The JSON bundle has a couple of different fields:

hncheung23 commented 6 years ago

In classification, we probably use accuracy matrix to compute the accuracy of the prediction. However, for multi-labeling, how can we evaluate the performance of the prediction? For example, in this case, we have "labels": [ 31, 33, 67 ], "predict": [ [ "67", -0.07443475723266602 ], [ "33", -0.14996011555194855 ], [ "66", -0.677146315574646 ], [ "31", -0.8062863349914551 ], [ "65", -1.2875503301620483 ], [ "34", -1.4505908489227295 ], [ "24", -1.5509711503982544 ], [ "93", -1.5873050689697266 ], [ "51", -1.842068076133728 ], [ "84", -1.842068076133728 ] ]. How can we determine how "good" the prediction is in this suitation? Other than passing in the --score flag, is there any other flag or method in fxml.py for performing performance evaluation?

Thank you very much!!!!!!!!!!!!!!!!!!!!

Refefer commented 6 years ago

When dealing with extreme multi-label classification, you really have two choices: look at precision or treat it as a ranking problem. Most authors look at it as a ranking function since accuracy is too myopic due to the propensity issue: that is, there are typically many labels that should be on a document that likely aren't. You can imagine quite easily a number of cases when all possible correct labels for a document are not enumerated for that document.

NDCG is a good metric for determining how "good" the prediction actually is: more correct labels should be score higher than less correct labels. Propensity weighting of NDCG gets even better since vanilla NDCG will heavily bias toward head labels - that is, labels that are most frequent. I'd recommend reading the paper: "Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Application", which I list on the main page for a discussion on it.

hncheung23 commented 6 years ago

I do not understand how the model can be trained if some labels never appear in the training dataset.

Refefer commented 6 years ago

@hncheung23 let me put it another way. Let's assume we're trying to learn tags for wikipedia articles: there are millions of potential tags across millions of articles. Knowing that wikipedia is human curated, how likely do you think it is that a all articles have enumerated all relevant tags? Those tags could happily exist on other documents, but do you believe that all relevant tags have been given to all relevant articles across the entire wikipedia dataset?

If you would agree that it is highly likely there are articles missing perfectly reasonable tags, the question naturally becomes: how does this impact our normal evaluation metrics? If our model learns correctly, it's going to suggest perfectly relevant tags for our wikipedia articles that isn't in our "gold" set. But if we use accuracy, we're going to claim the model does worse than it is in reality: all those correct tags that were simply omitted from the articles will be heavily penalized by the metrics. That doesn't seem right.

That's where treating it as a ranking function and in particularly propensity weighted ranking comes in to play.

hncheung23 commented 6 years ago

Thanks for your comment! This topic is really interesting.

hncheung23 commented 6 years ago

When I take a look at the dataset of Mediamill, I saw some rows of data missing one of the dimension. How can we address the problem of missing dimension?

Refefer commented 6 years ago

They're sparse vectors. Omitted data only means that the feature is zero.