Closed Sandy4321 closed 6 years ago
above formatting was wrong
so now fix it
Confusion Matrix
Predicted False True all
Actual
False 2776 10 2786
True 667 281 948
all 3443 291 3734
simple confusion matrix
Predicted 0 1
Actual
0 2776 10
1 667 281
sorry confusion matrix is messed up by this web page formating but main problem with ones recall is only 30% predicted 1 when actual 1 is for 281 observations when predicted 0 when actual 1 is 667
"May you share example for pre weighed targets: how to use weights key in classifier call"
@Sandy4321, example.py now includes an example showing how to use sample weighting.
I'm going to close this ticket. If there are any pending issues in this ticket that remain unaddressed, please open new tickets with concise explanations and short examples that highlight the problems.
real sparse data performance ( please find code below) from typical test case http://srome.github.io/Leveraging-Factorization-Machines-for-Sparse-Data-and-Supervised-Visualization/
shows that for unbalanced data preset weights for targets are needed, seems to be your code has this option, thanks :) May you share example for pre weighed targets: how to use weights key in classifier call
Performance example Epoch 45/50 loss: 0.0018767118101623477, min_loss: 0.0018767118101623477 current_loss = 0.00177938226446583 current_loss = 0.0016889192391447638 current_loss = 0.0016045080355204782 current_loss = 0.0015269844995869277 current_loss = 0.001451900011931204 Epoch 50/50 loss: 0.001451900011931204, min_loss: 0.001451900011931204
Factorization Machine Error: 0.18130690948044992 precision recall f1-score support
avg / total 0.85 0.82 0.78 3734
Confusion Matrix Predicted False True all Actual
False 2776 10 2786 True 667 281 948 all 3443 291 3734
simple confusion matrix Predicted 0 1 Actual
0 2776 10 1 667 281 len v = 2 len v[0] = 1048576 len w1 = 1048576 len w0 = 1
CODE:
S_May3_changed_hashing_size_Theano_FM_sparse.py
C:\Windows\system32>conda install theano <- good
conda install m2w64-toolchain <-destroys theano
S_May2_changed_hashing_size_Theano_FM_sparse.py
S_May2_hashing_FM_sparse.py
S_apr25_FM_sparse
https://github.com/dstein64/PyFactorizationMachines
http://srome.github.io/Leveraging-Factorization-Machines-for-Sparse-Data-and-Supervised-Visualization/
import pickle import pandas as pd from sklearn.feature_extraction import FeatureHasher
original from PyFactorizationMachines.src.pyfm import FactorizationMachineClassifier
from pyfms import Classifier from pyfms import regularizers from sklearn.model_selection import train_test_split from pandas_ml import ConfusionMatrix import pandas as pd
parameters
flag_0_use_all_hash_features_or_1_set_number_of_features = 0 number_of_features_for_Hasher = 12345 #1234 # 123#12345
if 1: from sklearn.datasets import fetch_20newsgroups twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
else:
twenty_train.target_names
def define_label(x, target_names): name = target_names[x] return 1 if 'comp' in name else 0
target = pd.Series(twenty_train.target).apply(lambda x : define_label(x,twenty_train.target_names)) target.mean() # Prevalence of the label
q=3 def clean_text(text):
Basic cleaning
X = [clean_text(x) for x in twenty_train.data]
q=1
Hash away!
if flag_0_use_all_hash_features_or_1_set_number_of_features: fh = FeatureHasher(input_type='string',n_features= number_of_features_for_Hasher, non_negative=True) else: fh = FeatureHasher(input_type='string', non_negative=True) # full number of features X_t = fh.transform(X)
Bin the inputs so that the "interaction" terms are more interpretable
X_bin = X_t.copy() X_bin[X_bin >= 1] = 1
if 1: fm_classifier = Classifier(X_t.shape[1] , k=2, X_format="csr") # original else:
not working
X_train, X_test, y_train, y_test = train_test_split(X_bin, target, test_size=0.33, random_state=42)
original as in example.py
type(y_train)
<class 'numpy.ndarray'>
original f.fit(X_train, y_train, verbosity=50, nb_epoch=200)
original f.fit(X_train, y_train, verbosity=5, nb_epoch=20)
fm_classifier.fit(X_train, y_train.values, verbosity=5, nb_epoch= 50)
q=4 from sklearn.metrics import accuracy_score def error_score(y_true, y_pred): return 1.0 - accuracy_score(y_true, y_pred) print() print('Factorization Machine Error: {}'.format( error_score(y_test, fm_classifier.predict(X_test)))) q=6
from sklearn.metrics import classification_report print( classification_report( y_test, fm_classifier.predict(X_test) ) )
q=6
cm = ConfusionMatrix(y_test.values, fm_classifier.predict(X_test) ) print('Confusion Matrix') print(cm)
y_actu = pd.Series(y_test.values, name='Actual') y_pred = pd.Series(fm_classifier.predict(X_test), name='Predicted') df_confusion = pd.crosstab(y_actu, y_pred) print(' \n \n simple confusion matrix') print(df_confusion)
cm.print_stats()
q=8 v = fm_classifier.v.eval() print('len v =', len(v)) print('len v[0] =', len(v[0])) w1 = fm_classifier.w1.eval() print('len w1 = ', len(w1) ) w0 = fm_classifier.w0.eval() print('len w0 = ', len(w0) ) q=7 ''' y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2] y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2] cm = ConfusionMatrix(y_actu, y_pred) cm.print_stats()
q=8
import pandas as pd y_actu = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2], name='Actual') y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2], name='Predicted') df_confusion = pd.crosstab(y_actu, y_pred)
q=7
'''