This branch changes the preparation pipeline from clean -> select records -> engineer features -> select features to clean -> engineer features -> select records -> scale features -> select features.
Some major changes and key takeaways:
output of feature engineering is a single .csv file which can be shared to other parties
output of select records, scale features, and select features can be used to train models (they are .h5 files)
the dataset is split into objectives AND split into train/test during select records
feature scaling can be completely omitted and even skipped
feature scaling now also supports normalization by class count or line count
year (and source and class/line count) is kept in the dataframes as metadata - it is prefixed by metadata_
This branch changes the preparation pipeline from
clean -> select records -> engineer features -> select features
toclean -> engineer features -> select records -> scale features -> select features
. Some major changes and key takeaways:.csv
file which can be shared to other partiesselect records
,scale features
, andselect features
can be used to train models (they are.h5
files)select records
metadata_