4paradigm / OpenMLDB

OpenMLDB is an open-source machine learning database that provides a feature platform computing consistent features for training and inference.
https://openmldb.ai
Apache License 2.0
1.58k stars 317 forks source link

The discrete feature values in the gcformat sample data generated by the OpenMLDB SQL feature extraction script are inconsistent with those calculated by the PICO script #3923

Open yht520100 opened 4 months ago

yht520100 commented 4 months ago

Bug Description Service Version: 0.9.0 The discrete feature values in the gcformat sample data generated by the OpenMLDB SQL feature extraction script are inconsistent with those calculated by the PICO script.

Expected Behavior Current incorrect format: label| slot:sign:origin-value Correct format: label index| slot:sign:origin-value

Relation Case OpenMLDB SQL Feature Extraction Example:

0| 1:0:1 2:4599670039981440374 3:6365000770384461703 4:0:93.200000
1| 1:0:2 2:5613161932270271752 3:-1384602352766124944 4:0:93.075000
0| 1:0:3 2:4599670039981440374 3:-6239076729344379818 4:0:92.893000

PICO Feature Extraction Example:

0 0| 2:-8773247204422130117:1 3:4042412524814531440 4:6048373541161169225 5:4681710344575317709:0x1.74ccccccccccdp6
1 1| 2:-8773247204422130117:2 3:6142047291687075953 4:1461111459061395210 5:4681710344575317709:0x1.744cccccccccdp6
0 2| 2:-8773247204422130117:3 3:4042412524814531440 4:3353218529862650678 5:4681710344575317709:0x1.73926e978d4fep6

Steps to Reproduce

  1. data schema:
    id[Int],age[Int],job[String],cons_price_idx[Double],y[Int]
  2. PICO Feature Extraction Script:
    target_y = binary_label(y)
    f_id = continuous(id)
    f_age = discrete(age)
    f_job = discrete(job)
    f_cons_price_idx = continuous(cons_price_idx)
  3. OpenMLDB SQL Feature Extraction Script:
    select gcformat(
       binary_label(bool(y)),
       continuous(id),
       discrete(age),
       discrete(job),
       continuous(cons_price_idx)
    ) as instance from main_table
aceforeverd commented 4 months ago

PICO is a implementation-only reference for OpenMLDB SQL AFAIK. So if there exists much more authorized implementations like a widely used Python ML package, we may still prefer the authorized one.

I dont know more detail, @wyl4pd help checkout