aksnzhy / xlearn

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.
https://xlearn-doc.readthedocs.io/en/latest/index.html
Apache License 2.0
3.08k stars 518 forks source link

Segmentation fault in python FFM predict #50

Open alexklibisz opened 6 years ago

alexklibisz commented 6 years ago

I've trained a model and saved it to disk. When I try to make predictions, I get a segmentation fault like below:

[ ACTION     ] Load model ...
[------------] Load model from artifacts/xlfmrec/model-ffm-best-val.out
[------------] Loss function: cross-entropy
[------------] Score function: ffm
[------------] Number of Feature: 359966
[------------] Number of K: 8
[------------] Number of field: 14
[------------] Time cost for loading model: 0.12 (sec)
[ ACTION     ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (artifacts/xlfmrec/feats-tst-batch-1.txt.bin) NOT found. Convert text file to binary file.
[------------] Time cost for reading problem: 0.00 (sec)
[ ACTION     ] Start to predict ...
Segmentation fault

The python code for loading and trying to make the prediction:

ffm_model = xl.create_ffm()
ffm_model.setTest('test.txt')
ffm_model.setSigmoid()
ffm_model.predict(model_path, 'out.txt')

Here are the first few lines of my test file. In total it has 2556765 lines.

0       0:1:1   1:1:1   2:2:1   6:4278:1        3:179:1 4:2044:1        5:6:1   11:15897:1      8:7223:1        9:0:1   13:7:1  12:5:1
0       0:1:1   1:1:1   2:2:1   6:328:1 3:11:1  4:10:1  7:150:1 5:4:1   11:15897:1      8:7223:1        9:0:1   13:7:1  12:5:1
0       0:3:1   2:5:1   6:170162:1      3:1379:1        4:7085:1        7:4239:1        5:5:1   11:9030:1       8:3870:1        9:0:1   13:7:1  12:5:1
0       0:4:1   1:7:1   2:6:1   6:98783:1       3:3009:1        4:9289:1        5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:133246:1      3:7370:1        4:828:1 5:63:1  11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:57621:1       3:242:1 5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:10008:1       3:939:1 4:4144:1        7:2608:1        5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:1:1   1:1:1   2:1:1   6:6011:1        3:1080:1        4:389:1 7:777:1 5:6:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:4:1   1:7:1   2:6:1   6:4152:1        3:335:1 4:1982:1        7:1224:1        5:4:1   11:24233:1      8:26:1  9:16:1  10:1:1  13:12:1 12:2:1
0       0:1:1   1:1:1   2:2:1   6:5740:1        3:143:1 5:4:1   11:4776:1       8:1965:1        9:0:1   13:2:1  12:2:1
alexklibisz commented 6 years ago

An update on the problem: There are certain values that occur in the test set but not in the training set. AKA "cold-start" or "out-of-vocabulary".

If I leave these out of the test set then the prediction works fine. One way to validate this is to just run a prediction on the training set itself.

This is understandable, but perhaps there should be a warning printed or an option added for skipping cold-start values.

aksnzhy commented 6 years ago

You are right. Now xLearn will crash if the feature in the test data doesn't exist in the training data. We will fix this error by pre-reading the test data before model initialization.

cmutrufflepig commented 5 years ago

Why is it that when I stay in python shell after fitting, model.setTest and model.predict work. But when I save the model and try to run it in a new shell it gives me segmentation fault?

thank you in advance, wonderful work.

tuzhe0210 commented 5 years ago

how to handle the unseen feature is the right thing to do , just add the feature from test data is deceive ourselves .

mohamed-ali commented 5 years ago

@aksnzhy Thanks for the library.

Is this fixed? It's actually a blocker, because we cannot use this tool to predict on data from an environment where the features are generated daily by user interactions. For example, imagine you train a model on one month worth of data, where the data contains a feature like, for instance, user-agent. If in the one month of data, hypothetically, we have only "chrome" and "Firefox" and in the data coming from the day we want to predict on there is a user who has a user-agent "safari" for instance then it will give a segmentation fault error. Thus blocking the whole system in production!

I think a fix would be to multiply unseen features with 0 on prediction time. Thus keep the prediction score based on seen features without crashing the whole system.

What do you think? Do you have an other workaround for this?

aksnzhy commented 5 years ago

@mohamed-ali Thanks for your suggestion. I think that the easiest way to fix this issue is to give the unseen feature an 0 index in prediction. After that, I will update xLearn to support online learning that can solve this issue better. I will fix this issue these days.

mohamed-ali commented 5 years ago

@aksnzhy Okay thanks! looking forward to the fixed-version release :).

mohamed-ali commented 5 years ago

@aksnzhy I would like a small clarification if you may. You said that a possible (easy) workaround would be to assign a 0 index to unseen features. By "index" are you referring to the field index (in the ffm format) or the features index or to something else?

The reasoning behind my question is that: if by a simple manipulation of the input data (by assigning a 0 index to unseen features when doing the encoding) this problem is avoided, then I can start doing that on the data preparation pipeline on my side while waiting for the full fix.

Thanks for clarifying that point.

aksnzhy commented 5 years ago

@mohamed-ali Hi, I solved this issue by just ignore the unseen feature in prediction task. You can have try. Thanks!

mohamed-ali commented 5 years ago

@aksnzhy Thank you! That's great news! Should I re-build from source to get the fix or is it sufficient to upgrade the python-package?

Thank you.

aksnzhy commented 5 years ago

@mohamed-ali Yes, you need to re-build it from the latest source code.