Adding alternative minhash script

ChenghaoMou commented 2 years ago

This addresses the issue in #9 by detecting duplicate community in a graph and keeping only one index within one community. It also adds some reference points regarding #7:

tokenization method: regex method parameters: codeparrot/codeparrot-clean hardware: Apple Silicon (10-core, 64 GB) memory usage: TODO time: ~70 minutes duplication results, examples: see README results

ChenghaoMou commented 2 years ago

@loubnabnl @lvwerra Let me know what you think. If the speed-up scales well, then we should expect ~2.5x improvement over speed.

I haven't got the chance to test it with python-safe-license.

lvwerra commented 2 years ago

That's great! The full dataset is now up with each language in a folder. So you can load it with e.g.:

ds = load_dataset("bigcode/multi_safe_license_raw", data_dir="data/html", use_auth_token=True, split="train")

This will download roughly 800GB of HTML and is one of the largest languages. How long do you think this would take on a machine similar to the one in BS (200-400 cores, few TB of memory)? If we can do that in a reasonable time then we are good :)

ChenghaoMou commented 2 years ago

Reference on data/python: Total time and cost:

1.5 hrs hashing
2 hrs indexing
10 mins querying
2 hrs building clusters and finding connected components
0.5 hr post-processing

total: 6 hours ~$60 with a m1 on GCP

Results: 13190256 are kept, compared to 15027370 using the original script/the cluster version. They share 11034248 documents (73%).

The main difference should be introduced by the clustering/graph part.

cc @lvwerra @loubnabnl

ChenghaoMou commented 2 years ago

I also noticed that there are some duplicates if you just use repository_name + path. For example, 13190337 docs have 13190256 unique repository_name + path values in my results.

loubnabnl commented 2 years ago

Thanks for working on this! The deduplication is really fast and the numbers are not very far from the original script, we can probably train a small model to asses the change

Regarding the duplicate paths, @bigximik mentioned some edge cases when using git hashes for the exact deduplication. There could probably be more than what you found since we only kept one path for each file instead of all of those where it appeared, if necessary we can run exact deduplication with md5 hashes for example.

bigximik commented 2 years ago

@ChenghaoMou thanks

i have found that at least empty files have more than one git hash, also i have a bug what if symlink is pushed, i store its hash and size but content is of the linked file if it was reachable after git clone of that repo

However, it is better for me to look what exactly happened. Do you have any examples?

bigximik commented 2 years ago

i might know what happened, it might be we have different pushes of the same repo so head hash is different but repo name and file path is the same, i will check if we have these cases in db, i will check for repos, first.

however, the cases @ChenghaoMou discovered must be for files in those different version of the repos which have been changed or have other reason to have different git hash

bigximik commented 2 years ago

wonder if those files have been removed after near dedup

lvwerra commented 2 years ago

That's awesome work @ChenghaoMou! Do you have some insights into how each of the steps scales? E.g. if we have more cores available, which parts can we expect to go faster? For example, I'd expect hashing to very scalable, right?

ChenghaoMou commented 2 years ago

@bigximik here are some examples

repository_name,path
climachine/climfill,climfill/feature_engineering.py
LucasEmmes/pythonScripts,Sudoku Solver/sudoku.py
undeadyequ/protest-detection-violence-estimation,util.py
toutpuissantged/crack,src/core/config.py
lucasmiranda42/deepof,deepof/model_utils.py
rhwang10/elsa,bot/bot.py
surfedushare/search-portal,service/surf/apps/themes/admin.py
Sitelink3D-v2-Developer/sitelink3dv2-examples,components/reports/report_list/report_list.py

I did run the Jaccard similarity on some of those "duplicates", their scores are lower than the threshold. It looks like some are just old (but also bit more different) versions.

ChenghaoMou commented 2 years ago

@lvwerra Yes. This is what I observed:

Hashing scales with both the number of cores and single core performance (clock speed, for example). With caching, it also does not require much memory. Indexing is one bottleneck that ties to single core performance. It is basically putting data in to a dictionary, so it is very hard to parallelize. With a database backend(Redis/Cassandra), you can utilize multi-threads, but it will slow down the query significantly. Querying can scale with multiple cores, but it might also require a lot more memory depending on the system implementation of multiprocessing (fork with copy-on-write would enable sharing in-memory index across process and improve speed ideally) I did see a spike of memory usage doing so on GCP m1 machine, but it only takes a few minutes to finish. Need more investigation. Building a graph (slow) and finding connected components (fast) is another bottleneck that ties to single core performance (networkit uses OpenMP for some algorithms, but not all of them).

bigximik commented 2 years ago

@ChenghaoMou , yes, all those examples have more than one commit in the dataset:

climachine/climfill
['c4cd6797d932b7e21004e4172c9c5a05b2a24e47', '2bc1ace5d880e18d8f351373921f385a2c0d9bd8']
LucasEmmes/pythonScripts
['139e67cdf6d438b177410bc1b12db2ccf81fd591', '0316b355795cd7f99012321a94492d57af60dd8d']
undeadyequ/protest-detection-violence-estimation
['94446f89682c7052a15cbec4fb5092dd2dabf4b4', '0781e83d7b37353be1cb25ca2a4917de416a7891']
toutpuissantged/crack
['7e1228c1b7f4527d4ac603f59b30058a60f5f31a', '20fe3555ffc1a80f4a77b4956e30b85fa0406b97']
lucasmiranda42/deepof
['d98a4cb4276f6952b323430c0e38d55eaaebde18', '1c270d35c6c97e109df7e6aa2d6756ee3a2b55d0']
rhwang10/elsa
['80b3a105af13afa94fdd192b17ed41b715965a97', '0b8f09280cbd4e6992bec8dcbe9bf498419a36cc']
surfedushare/search-portal
['708a0d05eee13c696ca9abd7e84ab620d3900fbe', 'f5486d6b07b7b04a46ce707cee5174db4f8da222']
Sitelink3D-v2-Developer/sitelink3dv2-examples
['d9d832a2b488969e4998c5158ec39385b70d88e8', '254a79b359ad59e3ee086e4b4ba10e423f4b1c84']

bigximik commented 2 years ago

There are around 34K repos which have more than one head stored, it is between 137.36 M repos we have. It is result how i was recording completion and because i implicitly treated same reopos with different heads as different caring only not to download repo with the same head twice.

def get_head_hexsha_and_save_repo_info_if_duplicate_or_exist(repo_name, dst_root, timeout):
    hexsha = get_remote_head_hexsha(repo_name, timeout)
    if hexsha is None:
        return None

    repo_info = {
        'head_hexsha': hexsha,
        'name': repo_name
    }
    dst_path, repo_info_fn = get_repo_save_path_file(repo_info, dst_root)
    # already processed
    if repo_info_fn.is_file():
        return hexsha

    duplicates = list(dst_path.glob(repo_info["head_hexsha"] + "_*.pkl"))
    if len(duplicates) == 0:
        return None

    repo_info = pickle_load(duplicates[0])
    repo_info['name'] = repo_name
    save_repo_info(repo_info, dst_root)
    return hexsha

ChenghaoMou commented 2 years ago

I see. The only issue I can see is that when you have a file that has two versions, say the old one has one function and the new one has ten functions. The one function might be a duplicate in a sense that it is repeated in the new file, but their file-level Jaccard similarity would be very low. This is somehow related to #8. But it might also just show the model the evolution of the code. Would be interesting to see the impact with multiple git hashes and with latest git hashes.

ChenghaoMou commented 2 years ago

@loubnabnl I think it is ready for merge. Please let me know if I am missing anything.

bigcode-project / bigcode-analysis

Adding alternative minhash script #11