Open ChenghaoMou opened 2 years ago
Since we are dealing with code languages what would be the downside of whitespaces?
Different tokenizers shows slightly different results (all metrics are time in seconds except last two columns):
Model | All | Loading | Minhash | Index | Query | Clustering | Deduplicate | Save | Before | After |
---|---|---|---|---|---|---|---|---|---|---|
codebert-base | 497.50 | 2.42 | 407.31 | 33.21 | 7.14 | 3.39 | 0.77 | 5.37 | 300000 | 265462 |
codegen-2B-multi | 470.88 | 2.29 | 382.11 | 31.97 | 7.00 | 3.32 | 0.77 | 5.73 | 300000 | 265590 |
codeparrot | 485.77 | 2.19 | 396.86 | 32.71 | 7.04 | 3.18 | 0.76 | 5.33 | 300000 | 267085 |
regex | 167.65 | 2.31 | 80.09 | 31.80 | 6.88 | 3.20 | 0.72 | 5.41 | 300000 | 268624 |
incoder-6B | 437.87 | 2.28 | 349.05 | 32.82 | 6.95 | 2.88 | 0.73 | 5.53 | 300000 | 271802 |
space | - | 2.18 | 0.18 | 31.17 | 6.87 | 2.42 | 0.04 | 5.28 | 300000 | 278664 |
As we extend deduplication to a wide range of languages, what tokenization method to use will have an impact on the final results.
The current script uses a simple regex and uni-gram to perform minhash calculation. What are the consequences using a different configuration?