CosmicHorrorDev / rust_text_classifier

[WIP] Determine if text is about the Rust game or Rust programming language
Apache License 2.0
13 stars 1 forks source link

Try out different tokenizers? #6

Open CosmicHorrorDev opened 3 years ago

CosmicHorrorDev commented 3 years ago

The current tokenizer is pretty unaware of the structure of the text. Situations to improve upon would be

tokenizing links

Something like http://www.google.com/useless/junk gets transformed [http, www, google, com, useless, junk] when I think it would be better to just be [http://www.google.com] since talking about just http and https is common in rust and some links like docs or github.com are very good indicators

It also allows for actually using links like v.reddit.com and i.reddit.com that lose a lot of structure when tokenized

tokenizing code

Code blocks are very common in r/rust posts yet much of the syntax is ignored when tokenizing. It would be good to either recognize the code block and retain certain information normally stripped out (like :: or -> for example), or it would likely just be enough to recognize a code block and classify if it's rust (since that would be very unlikely to see for the rust game

tokenizing reddit specific things

This is primarily for certain reddit specific things like /u/username or /r/subreddit

CosmicHorrorDev commented 3 years ago

Following along with this another issue has been tokenizing the singular and plural form of words separately where one variant is used much more often than the other. This means that very good keywords to focus on like Rustacean and IDEs is not picked up since the other variant is the more common one. The naive approach or just removing a trailing s won't always work since English has it's special cases, but it should be good for starting off at least and I can't think of any obvious conflicts that it would cause

CosmicHorrorDev commented 3 years ago

Progress is being made on trying out different tokenizer methods (in a hacky way right now). Currently simplifying URLs seems to have a good effect. The current method strips off everything other than the netloc to avoid polluting the keywords with a bunch of junk that can be included in the path.

This simple change seems to greatly reduce the number of false positives while also lowering the number of true positives. Here are the results of 1,000 runs with the current corpus

Old

Threshold: 0.5
           correct    incorrect  ignored
lang       97.61%     2.39%      0.00%
game       96.64%     3.36%      0.00%
Threshold: 0.55
           correct    incorrect  ignored
lang       95.67%     1.24%      3.09%
game       93.83%     1.84%      4.33%
Threshold: 0.6
           correct    incorrect  ignored
lang       93.10%     0.74%      6.15%
game       89.46%     0.97%      9.57%
Threshold: 0.65
           correct    incorrect  ignored
lang       88.61%     0.47%      10.92%
game       83.34%     0.47%      16.19%
Threshold: 0.7
           correct    incorrect  ignored
lang       82.32%     0.31%      17.37%
game       74.29%     0.25%      25.47%
Threshold: 0.75
           correct    incorrect  ignored
lang       74.28%     0.14%      25.58%
game       66.28%     0.12%      33.60%
Threshold: 0.8
           correct    incorrect  ignored
lang       66.01%     0.08%      33.91%
game       57.22%     0.07%      42.71%
Threshold: 0.85
           correct    incorrect  ignored
lang       57.47%     0.04%      42.49%
game       47.09%     0.04%      52.86%
Threshold: 0.9
           correct    incorrect  ignored
lang       48.39%     0.01%      51.60%
game       37.36%     0.02%      62.61%
Threshold: 0.95
           correct    incorrect  ignored
lang       39.32%     0.00%      60.68%
game       28.74%     0.01%      71.25%
Threshold: 1.0
           correct    incorrect  ignored
lang       31.91%     0.00%      68.09%
game       20.95%     0.00%      79.05%

New

Threshold: 0.5
           correct    incorrect  ignored
lang       97.69%     2.31%      0.00%
game       96.48%     3.52%      0.00%
Threshold: 0.55
           correct    incorrect  ignored
lang       95.50%     1.42%      3.07%
game       93.50%     1.89%      4.61%
Threshold: 0.6
           correct    incorrect  ignored
lang       92.03%     0.90%      7.08%
game       88.94%     1.10%      9.96%
Threshold: 0.65
           correct    incorrect  ignored
lang       87.03%     0.48%      12.49%
game       81.52%     0.54%      17.94%
Threshold: 0.7
           correct    incorrect  ignored
lang       80.15%     0.24%      19.61%
game       72.56%     0.28%      27.17%
Threshold: 0.75
           correct    incorrect  ignored
lang       71.72%     0.08%      28.20%
game       63.71%     0.14%      36.15%
Threshold: 0.8
           correct    incorrect  ignored
lang       62.26%     0.05%      37.69%
game       54.48%     0.08%      45.44%
Threshold: 0.85
           correct    incorrect  ignored
lang       52.47%     0.01%      47.53%
game       44.50%     0.05%      55.44%
Threshold: 0.9
           correct    incorrect  ignored
lang       43.02%     0.00%      56.98%
game       35.14%     0.03%      64.82%
Threshold: 0.95
           correct    incorrect  ignored
lang       34.05%     0.00%      65.95%
game       26.99%     0.01%      73.00%
Threshold: 1.0
           correct    incorrect  ignored
lang       26.14%     0.00%      73.86%
game       19.61%     0.00%      80.39%

Analysis

So for the currently used threshold (70%) the number of true positives reduces by ~3% while the number of false positives is reduced by ~23%