dstein64 / pyfms

A Theano-based Python implementation of Factorization Machines (Rendle 2010).
https://pypi.org/project/pyfms/
MIT License
27 stars 7 forks source link

even with very modest size of data ( hashing is done for only 123 features ) code gives error #4

Closed Sandy4321 closed 6 years ago

Sandy4321 commented 6 years ago

even with very modest size of data ( hashing is done for only 123 features ) code gives error code is attached in zip file and copy pasted Thanks for help File "E:\Recommender_systems\code\PyFactorizationMachines_May2\PyFactorizationMachines\S_May2_changed_hashing_size_Theano_FM_sparse.py", line 64, in f.fit(X_train, y_train, verbosity=50, nb_epoch=200) File "E:\Recommender_systems\code\PyFactorizationMachines_May2\PyFactorizationMachines\pyfms\models.py", line 20, in fit X_train, y_train, error_function, optimizer, **kwargs) File "E:\Recommender_systems\code\PyFactorizationMachines_May2\PyFactorizationMachines\pyfms\core.py", line 207, in fit raise ArithmeticError("Non-finite loss function.")

builtins.ArithmeticError: Non-finite loss function.

code is

S_May2_changed_hashing_size_Theano_FM_sparse.py

S_May2_hashing_FM_sparse.py

S_apr25_FM_sparse

https://github.com/dstein64/PyFactorizationMachines

http://srome.github.io/Leveraging-Factorization-Machines-for-Sparse-Data-and-Supervised-Visualization/

import pickle import pandas as pd from sklearn.feature_extraction import FeatureHasher

original from PyFactorizationMachines.src.pyfm import FactorizationMachineClassifier

from pyfms import Classifier from sklearn.model_selection import train_test_split

parameters

number_of_features_for_Hasher = 123#12345

if 1: from sklearn.datasets import fetch_20newsgroups twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

S_May2_changed_hashing_size_Theano_FM_sparse.zip

filehandler = open(b"twenty_data.pkl","wb")
pickle.dump(twenty_train,filehandler)

else:

file_name = open("twenty_data.pkl",'rb')
twenty_train= pickle.load(file_name)

twenty_train.target_names

def define_label(x, target_names): name = target_names[x] return 1 if 'comp' in name else 0

target = pd.Series(twenty_train.target).apply(lambda x : define_label(x,twenty_train.target_names)) target.mean() # Prevalence of the label

q=3 def clean_text(text):

Basic cleaning

text=text.replace('\n','').replace('\t','').replace('<','').replace('>','').replace('|','')
return [x for x in text.split(' ') if len(x) > 3]

X = [clean_text(x) for x in twenty_train.data]

Hash away!

fh = FeatureHasher(input_type='string',n_features= number_of_features_for_Hasher, non_negative=True) X_t = fh.transform(X)

Bin the inputs so that the "interaction" terms are more interpretable

X_bin = X_t.copy() X_bin[X_bin >= 1] = 1

f = Classifier(X_t.shape[1] , k=2, X_format="csr") X_train, X_test, y_train, y_test = train_test_split(X_bin, target, test_size=0.33, random_state=42)

f.fit(X_train, y_train, verbosity=50, nb_epoch=200)

q=4

Sandy4321 commented 6 years ago

start and stop parameters loks strange X_train.shape (7580, 1234) start 7680 stop 7580 X_train[0:2] <2x1234 sparse matrix of type '<class 'numpy.float64'>' with 179 stored elements in Compressed Sparse Row format> in this Theano call theano_train(X_train[start:stop], y_train[start:stop], sample_weight[start:stop], epoch)

Sandy4321 commented 6 years ago

good news, target formats should be only numpy so this change f.fit(X_train, y_train.values, verbosity=2, nb_epoch=20) makes code running without error for number_of_features_for_Hasher = 1234 but very slow on computer without GPU use

now code looks like

S_May2_changed_hashing_size_Theano_FM_sparse.py

S_May2_hashing_FM_sparse.py

S_apr25_FM_sparse

https://github.com/dstein64/PyFactorizationMachines

http://srome.github.io/Leveraging-Factorization-Machines-for-Sparse-Data-and-Supervised-Visualization/

import pickle import pandas as pd from sklearn.feature_extraction import FeatureHasher

original from PyFactorizationMachines.src.pyfm import FactorizationMachineClassifier

from pyfms import Classifier from sklearn.model_selection import train_test_split

parameters

number_of_features_for_Hasher = 1234 # 123#12345

if 1: from sklearn.datasets import fetch_20newsgroups twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

filehandler = open(b"twenty_data.pkl","wb")
pickle.dump(twenty_train,filehandler)

else:

file_name = open("twenty_data.pkl",'rb')
twenty_train= pickle.load(file_name)

twenty_train.target_names

def define_label(x, target_names): name = target_names[x] return 1 if 'comp' in name else 0

target = pd.Series(twenty_train.target).apply(lambda x : define_label(x,twenty_train.target_names)) target.mean() # Prevalence of the label

q=3 def clean_text(text):

Basic cleaning

text=text.replace('\n','').replace('\t','').replace('<','').replace('>','').replace('|','')
return [x for x in text.split(' ') if len(x) > 3]

X = [clean_text(x) for x in twenty_train.data]

Hash away!

fh = FeatureHasher(input_type='string',n_features= number_of_features_for_Hasher, non_negative=True) X_t = fh.transform(X)

Bin the inputs so that the "interaction" terms are more interpretable

X_bin = X_t.copy() X_bin[X_bin >= 1] = 1

f = Classifier(X_t.shape[1] , k=2, X_format="csr") X_train, X_test, y_train, y_test = train_test_split(X_bin, target, test_size=0.33, random_state=42)

original as in example.py

type(y_train)

<class 'numpy.ndarray'>

original f.fit(X_train, y_train, verbosity=50, nb_epoch=200)

original f.fit(X_train, y_train, verbosity=5, nb_epoch=20)

f.fit(X_train, y_train.values, verbosity=2, nb_epoch=20)

q=4 S_May2_changed_hashing_size_Theano_FM_sparse.zip

next step to test with full data as in example http://srome.github.io/Leveraging-Factorization-Machines-for-Sparse-Data-and-Supervised-Visualization/

Sandy4321 commented 6 years ago

interesting link http://nowave.it/factorization-machines-with-tensorflow.html

dstein64 commented 6 years ago

@Sandy4321, I'm closing this issue, as it sounds like you have found a solution.

Regarding the speed, please see my comment here, regarding the installation of m2w64-toolchain.

If you encounter any future issues, please be sure to provide the simplest possible example that reproduces the issue, providing the example code in the ticket text, as opposed to attaching a zip file with code.

Sandy4321 commented 6 years ago

1 conda install m2w64-toolchain <-destroys theano 2 I uninstalled m2w64-toolchain 3 I uninstalled theano 4 installed theano again with C:\Windows\system32>conda install theano previous installation was C:\Windows\system32>pip install theano now all works thanks 5 interesting how to check if I run with GPU or with CPU

dstein64 commented 6 years ago

"interesting how to check if I run with GPU or with CPU"

Theano provides instructions for configuring a GPU. http://deeplearning.net/software/theano/tutorial/using_gpu.html

Theano options are configured with a .theanorc config file or a THEANO_FLAGS environment variable. Please see the link above.

Sandy4321 commented 6 years ago

from recommended link a see THEANO_FLAGS='device=cuda,floatX=float32' should I type it to some place of this file? https://github.com/dstein64/PyFactorizationMachines/blob/master/pyfms/optimizers.py

so I typed import os os.environ["THEANO_FLAGS"] = "mode=FAST_RUN,device=gpu,floatX=float32" per https://stackoverflow.com/questions/33988334/theano-config-directly-in-script but it gives error then change os.environ["THEANO_FLAGS"] = "mode=FAST_RUN,device=cuda,floatX=float32" gives error as well I use windows do you use Windows ?

dstein64 commented 6 years ago

@Sandy4321, I recommend you configure the environment variable before calling your program, or alternatively configure theano by using a .theanorc file. Please see the documentation here for more details. http://deeplearning.net/software/theano/library/config.html

I do not know why you're getting an error. I suspect it's an issue with how you're configuring theano, as opposed to being an issue with PyFactorizationMachines.

Sandy4321 commented 6 years ago

I see thanks