Re-generate topics and re-train fraud detection

Philipp-Sc / llm-fraud-detection

Robust semi-supervised spam detection using Rust native NLP pipelines.

Apache License 2.0

2 stars 2 forks source link

Re-generate topics and re-train fraud detection #1

Closed Philipp-Sc closed 1 year ago

Philipp-Sc commented 1 year ago

[x] Re-generate topics and re-train fraud detection with bigger dataset of governance proposals.

governance_proposal_spam_ham.csv 
---------------
count spam: 172
count ham: 2551

Note: This will be great to reduce false positives, since the model has not yet seen many ham (and spam) data for governance proposals.

Note: consider reducing the ham dataset by filtering some of the rejected proposals with high votes against. To make sure not to train likely spam as ham.

Philipp-Sc commented 1 year ago

[ ] add DAO governance proposals ~~first~~

Philipp-Sc commented 1 year ago

[x] refactor dataset loading: instead of loading a boolean load the label as f64. That way the float label from governance_proposal_spam_ham.csv can be used.

Philipp-Sc commented 1 year ago

Instead of predicting all topics at once (the sum of the predictions equal to 1) predict (binary) topic pairs e.g ["hot","cold"]

[x] evaluate performance vs previous technique.

New technique performs better. A potential drawback is that a higher number of topics might increase the inference time and makes it take to long on CPU only systems.

Philipp-Sc commented 1 year ago

[x] consider feature selection, to improve inference time. (relevant for CPU only systems)