Open CosmicHorrorDev opened 3 years ago
Following along with this another issue has been tokenizing the singular and plural form of words separately where one variant is used much more often than the other. This means that very good keywords to focus on like Rustacean and IDEs is not picked up since the other variant is the more common one. The naive approach or just removing a trailing s won't always work since English has it's special cases, but it should be good for starting off at least and I can't think of any obvious conflicts that it would cause
Progress is being made on trying out different tokenizer methods (in a hacky way right now). Currently simplifying URLs seems to have a good effect. The current method strips off everything other than the netloc to avoid polluting the keywords with a bunch of junk that can be included in the path.
This simple change seems to greatly reduce the number of false positives while also lowering the number of true positives. Here are the results of 1,000 runs with the current corpus
Threshold: 0.5
correct incorrect ignored
lang 97.61% 2.39% 0.00%
game 96.64% 3.36% 0.00%
Threshold: 0.55
correct incorrect ignored
lang 95.67% 1.24% 3.09%
game 93.83% 1.84% 4.33%
Threshold: 0.6
correct incorrect ignored
lang 93.10% 0.74% 6.15%
game 89.46% 0.97% 9.57%
Threshold: 0.65
correct incorrect ignored
lang 88.61% 0.47% 10.92%
game 83.34% 0.47% 16.19%
Threshold: 0.7
correct incorrect ignored
lang 82.32% 0.31% 17.37%
game 74.29% 0.25% 25.47%
Threshold: 0.75
correct incorrect ignored
lang 74.28% 0.14% 25.58%
game 66.28% 0.12% 33.60%
Threshold: 0.8
correct incorrect ignored
lang 66.01% 0.08% 33.91%
game 57.22% 0.07% 42.71%
Threshold: 0.85
correct incorrect ignored
lang 57.47% 0.04% 42.49%
game 47.09% 0.04% 52.86%
Threshold: 0.9
correct incorrect ignored
lang 48.39% 0.01% 51.60%
game 37.36% 0.02% 62.61%
Threshold: 0.95
correct incorrect ignored
lang 39.32% 0.00% 60.68%
game 28.74% 0.01% 71.25%
Threshold: 1.0
correct incorrect ignored
lang 31.91% 0.00% 68.09%
game 20.95% 0.00% 79.05%
Threshold: 0.5
correct incorrect ignored
lang 97.69% 2.31% 0.00%
game 96.48% 3.52% 0.00%
Threshold: 0.55
correct incorrect ignored
lang 95.50% 1.42% 3.07%
game 93.50% 1.89% 4.61%
Threshold: 0.6
correct incorrect ignored
lang 92.03% 0.90% 7.08%
game 88.94% 1.10% 9.96%
Threshold: 0.65
correct incorrect ignored
lang 87.03% 0.48% 12.49%
game 81.52% 0.54% 17.94%
Threshold: 0.7
correct incorrect ignored
lang 80.15% 0.24% 19.61%
game 72.56% 0.28% 27.17%
Threshold: 0.75
correct incorrect ignored
lang 71.72% 0.08% 28.20%
game 63.71% 0.14% 36.15%
Threshold: 0.8
correct incorrect ignored
lang 62.26% 0.05% 37.69%
game 54.48% 0.08% 45.44%
Threshold: 0.85
correct incorrect ignored
lang 52.47% 0.01% 47.53%
game 44.50% 0.05% 55.44%
Threshold: 0.9
correct incorrect ignored
lang 43.02% 0.00% 56.98%
game 35.14% 0.03% 64.82%
Threshold: 0.95
correct incorrect ignored
lang 34.05% 0.00% 65.95%
game 26.99% 0.01% 73.00%
Threshold: 1.0
correct incorrect ignored
lang 26.14% 0.00% 73.86%
game 19.61% 0.00% 80.39%
So for the currently used threshold (70%) the number of true positives reduces by ~3% while the number of false positives is reduced by ~23%
The current tokenizer is pretty unaware of the structure of the text. Situations to improve upon would be
tokenizing links
Something like
http://www.google.com/useless/junk
gets transformed[http, www, google, com, useless, junk]
when I think it would be better to just be[http://www.google.com]
since talking about just http and https is common in rust and some links likedocs
orgithub.com
are very good indicatorsIt also allows for actually using links like
v.reddit.com
andi.reddit.com
that lose a lot of structure when tokenizedtokenizing code
Code blocks are very common in r/rust posts yet much of the syntax is ignored when tokenizing. It would be good to either recognize the code block and retain certain information normally stripped out (like
::
or->
for example), or it would likely just be enough to recognize a code block and classify if it's rust (since that would be very unlikely to see for the rust gametokenizing reddit specific things
This is primarily for certain reddit specific things like
/u/username
or/r/subreddit