dstein64 / pyfms

A Theano-based Python implementation of Factorization Machines (Rendle 2010).
https://pypi.org/project/pyfms/
MIT License
27 stars 7 forks source link

low performance for known test case #7

Closed Sandy4321 closed 6 years ago

Sandy4321 commented 6 years ago

real sparse data performance ( please find code below) from typical test case http://srome.github.io/Leveraging-Factorization-Machines-for-Sparse-Data-and-Supervised-Visualization/

shows that for unbalanced data preset weights for targets are needed, seems to be your code has this option, thanks :) May you share example for pre weighed targets: how to use weights key in classifier call

Performance example Epoch 45/50 loss: 0.0018767118101623477, min_loss: 0.0018767118101623477 current_loss = 0.00177938226446583 current_loss = 0.0016889192391447638 current_loss = 0.0016045080355204782 current_loss = 0.0015269844995869277 current_loss = 0.001451900011931204 Epoch 50/50 loss: 0.001451900011931204, min_loss: 0.001451900011931204

Factorization Machine Error: 0.18130690948044992 precision recall f1-score support

      0       0.81      1.00      0.89      2786
      1       0.97      0.30      0.45       948

avg / total 0.85 0.82 0.78 3734

Confusion Matrix Predicted False True all Actual
False 2776 10 2786 True 667 281 948 all 3443 291 3734

simple confusion matrix Predicted 0 1 Actual
0 2776 10 1 667 281 len v = 2 len v[0] = 1048576 len w1 = 1048576 len w0 = 1

CODE:

S_May3_changed_hashing_size_Theano_FM_sparse.py

C:\Windows\system32>conda install theano <- good

conda install m2w64-toolchain <-destroys theano

S_May2_changed_hashing_size_Theano_FM_sparse.py

S_May2_hashing_FM_sparse.py

S_apr25_FM_sparse

https://github.com/dstein64/PyFactorizationMachines

http://srome.github.io/Leveraging-Factorization-Machines-for-Sparse-Data-and-Supervised-Visualization/

import pickle import pandas as pd from sklearn.feature_extraction import FeatureHasher

original from PyFactorizationMachines.src.pyfm import FactorizationMachineClassifier

from pyfms import Classifier from pyfms import regularizers from sklearn.model_selection import train_test_split from pandas_ml import ConfusionMatrix import pandas as pd

parameters

flag_0_use_all_hash_features_or_1_set_number_of_features = 0 number_of_features_for_Hasher = 12345 #1234 # 123#12345

if 1: from sklearn.datasets import fetch_20newsgroups twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

filehandler = open(b"twenty_data.pkl","wb")
pickle.dump(twenty_train,filehandler)

else:

file_name = open("twenty_data.pkl",'rb')
twenty_train= pickle.load(file_name)

twenty_train.target_names

def define_label(x, target_names): name = target_names[x] return 1 if 'comp' in name else 0

target = pd.Series(twenty_train.target).apply(lambda x : define_label(x,twenty_train.target_names)) target.mean() # Prevalence of the label

q=3 def clean_text(text):

Basic cleaning

text=text.replace('\n','').replace('\t','').replace('<','').replace('>','').replace('|','')
return [x for x in text.split(' ') if len(x) > 3]

X = [clean_text(x) for x in twenty_train.data]

q=1

Hash away!

if flag_0_use_all_hash_features_or_1_set_number_of_features: fh = FeatureHasher(input_type='string',n_features= number_of_features_for_Hasher, non_negative=True) else: fh = FeatureHasher(input_type='string', non_negative=True) # full number of features X_t = fh.transform(X)

Bin the inputs so that the "interaction" terms are more interpretable

X_bin = X_t.copy() X_bin[X_bin >= 1] = 1

if 1: fm_classifier = Classifier(X_t.shape[1] , k=2, X_format="csr") # original else:

not working

reg = regularizers.L2(0, 0, .01)
fm_classifier = Classifier(X_t.shape[1] , k=2, X_format="csr",regularizer=reg)

X_train, X_test, y_train, y_test = train_test_split(X_bin, target, test_size=0.33, random_state=42)

original as in example.py

type(y_train)

<class 'numpy.ndarray'>

original f.fit(X_train, y_train, verbosity=50, nb_epoch=200)

original f.fit(X_train, y_train, verbosity=5, nb_epoch=20)

fm_classifier.fit(X_train, y_train.values, verbosity=5, nb_epoch= 50)

q=4 from sklearn.metrics import accuracy_score def error_score(y_true, y_pred): return 1.0 - accuracy_score(y_true, y_pred) print() print('Factorization Machine Error: {}'.format( error_score(y_test, fm_classifier.predict(X_test)))) q=6

from sklearn.metrics import classification_report print( classification_report( y_test, fm_classifier.predict(X_test) ) )

q=6

cm = ConfusionMatrix(y_test.values, fm_classifier.predict(X_test) ) print('Confusion Matrix') print(cm)

y_actu = pd.Series(y_test.values, name='Actual') y_pred = pd.Series(fm_classifier.predict(X_test), name='Predicted') df_confusion = pd.crosstab(y_actu, y_pred) print(' \n \n simple confusion matrix') print(df_confusion)

cm.print_stats()

q=8 v = fm_classifier.v.eval() print('len v =', len(v)) print('len v[0] =', len(v[0])) w1 = fm_classifier.w1.eval() print('len w1 = ', len(w1) ) w0 = fm_classifier.w0.eval() print('len w0 = ', len(w0) ) q=7 ''' y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2] y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2] cm = ConfusionMatrix(y_actu, y_pred) cm.print_stats()

q=8

import pandas as pd y_actu = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2], name='Actual') y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2], name='Predicted') df_confusion = pd.crosstab(y_actu, y_pred)

q=7

'''

Sandy4321 commented 6 years ago

above formatting was wrong so now fix it Confusion Matrix Predicted False True all Actual
False 2776 10 2786 True 667 281 948 all 3443 291 3734

simple confusion matrix Predicted 0 1 Actual
0 2776 10 1 667 281

Sandy4321 commented 6 years ago

sorry confusion matrix is messed up by this web page formating but main problem with ones recall is only 30% predicted 1 when actual 1 is for 281 observations when predicted 0 when actual 1 is 667

dstein64 commented 6 years ago

"May you share example for pre weighed targets: how to use weights key in classifier call"

@Sandy4321, example.py now includes an example showing how to use sample weighting.

I'm going to close this ticket. If there are any pending issues in this ticket that remain unaddressed, please open new tickets with concise explanations and short examples that highlight the problems.