[Near Deduplication] Tokenization

ChenghaoMou commented 2 years ago

As we extend deduplication to a wide range of languages, what tokenization method to use will have an impact on the final results.

The current script uses a simple regex and uni-gram to perform minhash calculation. What are the consequences using a different configuration?

lvwerra commented 2 years ago

Since we are dealing with code languages what would be the downside of whitespaces?

ChenghaoMou commented 2 years ago

Different tokenizers shows slightly different results (all metrics are time in seconds except last two columns):

Model	All	Loading	Minhash	Index	Query	Clustering	Deduplicate	Save	Before	After
codebert-base	497.50	2.42	407.31	33.21	7.14	3.39	0.77	5.37	300000	265462
codegen-2B-multi	470.88	2.29	382.11	31.97	7.00	3.32	0.77	5.73	300000	265590
codeparrot	485.77	2.19	396.86	32.71	7.04	3.18	0.76	5.33	300000	267085
regex	167.65	2.31	80.09	31.80	6.88	3.20	0.72	5.41	300000	268624
incoder-6B	437.87	2.28	349.05	32.82	6.95	2.88	0.73	5.53	300000	271802
space	-	2.18	0.18	31.17	6.87	2.42	0.04	5.28	300000	278664

bigcode-project / bigcode-analysis