bigcode-project / bigcode-analysis

Repository for analysis and experiments in the BigCode project.
Apache License 2.0
109 stars 20 forks source link

[Near Deduplication] Tokenization #10

Open ChenghaoMou opened 1 year ago

ChenghaoMou commented 1 year ago

As we extend deduplication to a wide range of languages, what tokenization method to use will have an impact on the final results.

The current script uses a simple regex and uni-gram to perform minhash calculation. What are the consequences using a different configuration?

lvwerra commented 1 year ago

Since we are dealing with code languages what would be the downside of whitespaces?

ChenghaoMou commented 1 year ago

Different tokenizers shows slightly different results (all metrics are time in seconds except last two columns):

Model All Loading Minhash Index Query Clustering Deduplicate Save Before After
codebert-base 497.50 2.42 407.31 33.21 7.14 3.39 0.77 5.37 300000 265462
codegen-2B-multi 470.88 2.29 382.11 31.97 7.00 3.32 0.77 5.73 300000 265590
codeparrot 485.77 2.19 396.86 32.71 7.04 3.18 0.76 5.33 300000 267085
regex 167.65 2.31 80.09 31.80 6.88 3.20 0.72 5.41 300000 268624
incoder-6B 437.87 2.28 349.05 32.82 6.95 2.88 0.73 5.53 300000 271802
space - 2.18 0.18 31.17 6.87 2.42 0.04 5.28 300000 278664