4paradigm / OpenMLDB

OpenMLDB is an open-source machine learning database that provides a feature platform computing consistent features for training and inference.
https://openmldb.ai
Apache License 2.0
1.59k stars 321 forks source link

The gcformat data format obtained from the feature extraction script using OpenMLDB SQL 0.9.0 is incorrect. #3920

Open yht520100 opened 6 months ago

yht520100 commented 6 months ago

Bug Description Service Version: 0.9.0 When performing feature extraction using OpenMLDB SQL, there are two issues with the format of the gcformat sample data. Firstly, there is no space between the "label" and the "|" character. Secondly, there is no "index" field.

Expected Behavior Current incorrect format: label| slot:sign:origin-value Correct format: label index| slot:sign:origin-value

Relation Case OpenMLDB SQL Feature Extraction Example:

0| 1:0:1 2:4599670039981440374 3:6365000770384461703 4:0:93.200000
1| 1:0:2 2:5613161932270271752 3:-1384602352766124944 4:0:93.075000
0| 1:0:3 2:4599670039981440374 3:-6239076729344379818 4:0:92.893000

PICO Feature Extraction Example:

0 0| 2:-8773247204422130117:1 3:4042412524814531440 4:6048373541161169225 5:4681710344575317709:0x1.74ccccccccccdp6
1 1| 2:-8773247204422130117:2 3:6142047291687075953 4:1461111459061395210 5:4681710344575317709:0x1.744cccccccccdp6
0 2| 2:-8773247204422130117:3 3:4042412524814531440 4:3353218529862650678 5:4681710344575317709:0x1.73926e978d4fep6

Steps to Reproduce

  1. data schema:
    id[Int],age[Int],job[String],cons_price_idx[Double],y[Int]
  2. PICO Feature Extraction Script:
    target_y = binary_label(y)
    f_id = continuous(id)
    f_age = discrete(age)
    f_job = discrete(job)
    f_cons_price_idx = continuous(cons_price_idx)
  3. OpenMLDB SQL Feature Extraction Script:
    select gcformat(
       binary_label(bool(y)),
       continuous(id),
       discrete(age),
       discrete(job),
       continuous(cons_price_idx)
    ) as instance from main_table