4paradigm / OpenMLDB

OpenMLDB is an open-source machine learning database that provides a feature platform computing consistent features for training and inference.
https://openmldb.ai
Apache License 2.0
1.57k stars 314 forks source link

The continuous feature values in the gcformat sample data generated by the OpenMLDB SQL feature extraction script are incorrect #3922

Open yht520100 opened 1 month ago

yht520100 commented 1 month ago

Bug Description Service Version: 0.9.0 The gcformat sample data generated by the OpenMLDB SQL feature extraction script contains incorrect continuous feature values, all of which are set to 0.

Expected Behavior Current incorrect format: label| slot:sign:origin-value Correct format: label index| slot:sign:origin-value

Relation Case OpenMLDB SQL Feature Extraction Example:

0| 1:0:1 2:4599670039981440374 3:6365000770384461703 4:0:93.200000
1| 1:0:2 2:5613161932270271752 3:-1384602352766124944 4:0:93.075000
0| 1:0:3 2:4599670039981440374 3:-6239076729344379818 4:0:92.893000

PICO Feature Extraction Example:

0 0| 2:-8773247204422130117:1 3:4042412524814531440 4:6048373541161169225 5:4681710344575317709:0x1.74ccccccccccdp6
1 1| 2:-8773247204422130117:2 3:6142047291687075953 4:1461111459061395210 5:4681710344575317709:0x1.744cccccccccdp6
0 2| 2:-8773247204422130117:3 3:4042412524814531440 4:3353218529862650678 5:4681710344575317709:0x1.73926e978d4fep6

Steps to Reproduce

  1. data schema:
    id[Int],age[Int],job[String],cons_price_idx[Double],y[Int]
  2. PICO Feature Extraction Script:
    target_y = binary_label(y)
    f_id = continuous(id)
    f_age = discrete(age)
    f_job = discrete(job)
    f_cons_price_idx = continuous(cons_price_idx)
  3. OpenMLDB SQL Feature Extraction Script:
    select gcformat(
       binary_label(bool(y)),
       continuous(id),
       discrete(age),
       discrete(job),
       continuous(cons_price_idx)
    ) as instance from main_table