Philipp-Sc / llm-fraud-detection

Robust semi-supervised spam detection using Rust native NLP pipelines.
Apache License 2.0
2 stars 2 forks source link

Improve engineered features for even better accuracy #2

Closed Philipp-Sc closed 1 year ago

Philipp-Sc commented 1 year ago

The current engineered 'hard-coded' features are very basic, while they provide useful information there is room for improvement.

src/build/feature_engineering/mod.rs

Instead of hard-coded conditions, create / augment with Bag Of Words vector that is derived from the training dataset.

E.g then using a frequency encoding of common words that often occur within spam but not in ham and vice versa. Resulting in two vectors that together contain the most important/common words for/against a spam classification.

Philipp-Sc commented 1 year ago

Update:

Instead of adding a whole count vector trained on the vocabulary to the engineered 'hard-coded' features:

Benefits: