Instead of hard-coded conditions, create / augment with Bag Of Words vector that is derived from the training dataset.
E.g then using a frequency encoding of common words that often occur within spam but not in ham and vice versa.
Resulting in two vectors that together contain the most important/common words for/against a spam classification.
The current engineered 'hard-coded' features are very basic, while they provide useful information there is room for improvement.
src/build/feature_engineering/mod.rs
Instead of hard-coded conditions, create / augment with Bag Of Words vector that is derived from the training dataset.
E.g then using a frequency encoding of common words that often occur within spam but not in ham and vice versa. Resulting in two vectors that together contain the most important/common words for/against a spam classification.