aksnzhy / xlearn

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.
https://xlearn-doc.readthedocs.io/en/latest/index.html
Apache License 2.0
3.08k stars 518 forks source link

FFM数据格式问题 #155

Open cowry5 opened 5 years ago

cowry5 commented 5 years ago

你好,我在转换为libffm的格式遇到一个问题。 首先我看了libffm的官方文档,是这样描述:

Data Format
===========

The data format of LIBFFM is:

<label> <field1>:<feature1>:<value1> <field2>:<feature2>:<value2> ...

`field' and `feature' should be non-negative integers. See an example `bigdata.tr.txt.'

It is important to understand the difference between `field' and `feature'. For example, if we have a raw data like this:

Click  Advertiser  Publisher
=====  ==========  =========
    0        Nike        CNN
    1        ESPN        BBC

Here, we have 

    * 2 fields: Advertiser and Publisher

    * 4 features: Advertiser-Nike, Advertiser-ESPN, Publisher-CNN, Publisher-BBC

Usually you will need to build two dictionares, one for field and one for features, like this:

    DictField[Advertiser] -> 0
    DictField[Publisher]  -> 1

    DictFeature[Advertiser-Nike] -> 0
    DictFeature[Publisher-CNN]   -> 1
    DictFeature[Advertiser-ESPN] -> 2
    DictFeature[Publisher-BBC]   -> 3

Then, you can generate FFM format data:

    0 0:0:1 1:1:1
    1 0:2:1 1:3:1

Note that because these features are categorical, the values here are all ones.

可以看到特征排序是是按行来排列的。一个field中的所有特征不是一起排序的。我在网上还看到更多另外一种格式,先把一个field排序完,接着排序下一个field。如下:

DictFeature[Advertiser-Nike]   -> 0
DictFeature[Advertiser-ESPN]   -> 1
DictFeature[Publisher-CNN]     -> 2
DictFeature[Publisher-BBC]     -> 3

Then, you can generate FFM format data:

    0 0:0:1 1:2:1
    1 0:1:1 1:3:1

请问 xlearn 支持的是哪种格式呢?谢谢。

aksnzhy commented 5 years ago

xLearn 中按照: field:feature:value 这个格式来存储。具体的 feature 怎么来排序其实对机器学习算法结果并没有什么影响,feature 只是一个 id 标志而已,是 1 还是 2 其实都没有关系,很多人还会对 feature 进行随机 hash,得到的id数字完全是随机的。

cowry5 commented 5 years ago

谢谢回复,这部分理解了。还要请问下,如果是数值特征的话,设置为一个field中,只设一个feature,只要与其它field中的feature都不同即可,是吗?

aksnzhy commented 5 years ago

@cowry5 是这样的。不过也存在有的 field 下有多个数值特征这种情况。

cowry5 commented 5 years ago

@aksnzhy 非常感谢回复。请问下若一个特征值存在多个元素(一个向量)该如何处理呢?如下:

Click       Advertiser            Publisher
=====       ==========              =========
    0        Nike, Adi              CNN
    1        ESPN, Adi              BBC

还有,我在本地用ffm跑了下,验证集的logloss还不错,可到线上就差了很多,下面是参数,您觉得可能出了什么问题呢?训练集大概100w,30维,验证集1w左右。

param = {'task':'binary', 'lr':0.001, 'lambda':0.004, 'epoch':35, 
         'k':10, 'init':0.55}

不好意思,接触不久,问的问题可能有点小白,实在见谅。

aksnzhy commented 5 years ago

一个特征存在多个元素可以直接拆成两个样例:

Click Advertiser Publisher ===== ========== ========= 0 Nike CNN 0 Adi CNN 1 ESPN BBC 1 Adi BBC

线上预测效果差可以看看是不是预测集和训练集数据的分布相差比较大。

RochaC commented 5 years ago

拆成两个样例,那后面的 Publiser 特征,更新的梯度不是线性的学习率啊。。这好像是不等的。