bigcode-project / bigcode-analysis

Repository for analysis and experiments in the BigCode project.
Apache License 2.0
113 stars 20 forks source link

[Near Deduplication] Benchmark #7

Open ChenghaoMou opened 2 years ago

ChenghaoMou commented 2 years ago

Provide results on large dataset with different near deduplication methods:

  1. minhash + lsh
  2. simhash
  3. any relevant methods

Details to be included:

ChenghaoMou commented 2 years ago
Model Deduplication Method Type Comment Src
CodeGeeX Paper Not Available https://models.aminer.cn/codegeex/blog/
InCoder Exact match based on alphanumeric tokens/md5 + Bloom filter Exact Many other analyses on decontamination, filtering https://arxiv.org/abs/2204.05999
CodeGen Exact match based on sha256 hashes Exact https://arxiv.org/abs/2203.13474
AlphaCode Exact match ignoring whitespaces Exact https://arxiv.org/abs/2203.07814
PolyCode Exact match sha256 Exact https://github.com/VHellendoorn/Code-LMs/blob/main/Data/deduplicate.py
PaLM Coder Levenshtein distance Near https://arxiv.org/abs/2204.02311
lvwerra commented 2 years ago

If we have a handful of deduplication strategies we could run some smaller model trainings to evaluate these approaches. We'll be working on the science plan in the next few days/weeks and in general preprocessing (incl. dedup) ablations will probably be in there for some studies.