Refefer / fastxml

FastXML / PFastXML / PFastreXML - Implementation of Extreme Multi-label Classification
Other
148 stars 47 forks source link

How to calculate ndcg from the clf predictions #32

Closed ventouris closed 4 years ago

ventouris commented 4 years ago

I am a little bit lost. I saw in the bin/fxml.py, that you predict ndcg with other performance metrics as well. However, using variable names that I don't understand is difficult for me to reproduce it.

X = csr_matrix(X_train.values)
X = [X[i].astype('float32') for i in range(X.shape[0])]
y = [[int(k) for k in list(np.where(i==1)[0])] for i in y_train.values]

w = weights.propensity(y)    
trainer = Trainer(n_trees=32, n_jobs=-1, leaf_classifiers=True)
trainer.fit(X,y, w)
trainer.save("multilabel_default_fastreXML.h5")

X = csr_matrix(self.X_test.values)
X = [X[i].astype('float32') for i in range(X.shape[0])]
clf = Inferencer(helpers.getModelsPath() + "multilabe_default_fastreXML.h5")
y_pred = clf.predict(X)

This is what I do, where I have X_train, y_train, X_test and y_test dataframes. The prediction is working, however, I am not sure how to proceed and use your functions to get the ndcg. Any idea?

Refefer commented 4 years ago

Hey there, Is the question how to compute the rank metrics from the predictions directly in a script outside of fxml.py script?

ventouris commented 4 years ago

Exactly. In the example above, where I run manually the FastXML (not in fxml.py), how can I compute the ndcg metric?

Refefer commented 4 years ago

Got it :)

The way the ndcg function works in fastxml is it takes an ordered list of relevancy scores and computes the ndcg:

For example:

scores = [1,2,0,3,2]
ndcg(scores, 3) 

Computes the ndcg@3 for those relevancy scores where the score 1 is in the highest rank.

To adapt your y_pred pretty easily, let's assume the following example:

Y = {117, 31, 12}
Y_pred = clf.predict(X, 'dict') # You can use sparse as well, but you need to do an argmax which is a bit more code
# Assuming you only have one example you're predicting
relevancy_scores = [1 if cls_idx in Y else 0 for cls_idx in Y_pred[0].keys()]
ndcg(relevancy_scores, 5)
ventouris commented 4 years ago

Thank you. It's working