Open ChenghaoMou opened 2 years ago
Model | Deduplication Method | Type | Comment | Src |
---|---|---|---|---|
CodeGeeX | Paper Not Available | https://models.aminer.cn/codegeex/blog/ | ||
InCoder | Exact match based on alphanumeric tokens/md5 + Bloom filter | Exact | Many other analyses on decontamination, filtering | https://arxiv.org/abs/2204.05999 |
CodeGen | Exact match based on sha256 hashes | Exact | https://arxiv.org/abs/2203.13474 | |
AlphaCode | Exact match ignoring whitespaces | Exact | https://arxiv.org/abs/2203.07814 | |
PolyCode | Exact match sha256 | Exact | https://github.com/VHellendoorn/Code-LMs/blob/main/Data/deduplicate.py | |
PaLM Coder | Levenshtein distance | Near | https://arxiv.org/abs/2204.02311 |
If we have a handful of deduplication strategies we could run some smaller model trainings to evaluate these approaches. We'll be working on the science plan in the next few days/weeks and in general preprocessing (incl. dedup) ablations will probably be in there for some studies.
Provide results on large dataset with different near deduplication methods:
Details to be included: