CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

Bitbucket diffs #5

Open aaronrmm opened 1 year ago

aaronrmm commented 1 year ago

Bitbucket has an API for public repos

Dataset URL - None

Does the dataset exists in a scraped format ? No (searched using google, papers with code, and kaggle).

Description

Bitbucket is far less popular for open source git repos, but does have them, and does provide an API for querying and filtering them. Because there are no stars in bitbucket as there are in github, we would have to approximate with number of watchers or number of contributors. It can also be filtered by language. It does not appear to be filterable by license.

Procedure

  1. Approximate the value of a bitbucket dataset by pulling metrics on open source. Using the Bitbucket API, pull the following information :

    • number of public repositories
    • distribution of watchers per repository
    • distribution of contributors per
    • number of commits per
  2. With the above information, determine a good metric for how repositories should be prioritized. Sort the repo list with this metric.

  3. Start pulling commit diffs from the highest priority repos. Docs

LouisCastricato commented 1 year ago

Great idea!

aaronrmm commented 1 year ago

This is annoying: https://community.developer.atlassian.com/t/cant-filter-public-repos-with-bitbucket-api/61919

LouisCastricato commented 1 year ago

Any ideas on how to fix this?

aaronrmm commented 1 year ago

I can pull it all down unfiltered. Or all repos since a certain timestamp. Otherwise, no.

aaronrmm commented 1 year ago

...which is what I'm currently doing.

ncoop57 commented 1 year ago

Are there any existing indexes we could use? @aaronrmm

herbiebradley commented 1 year ago

If you can get the commit message and hash then it should be simple to adapt the code at #31 to fit Bitbucket.

aaronrmm commented 1 year ago

I can get message, hash, diff, author, date, patch, parent commit. Which I think is everything needed. Currently am grabbing all the commit hashes for all the repos.

PhungVanDuy commented 1 year ago

Hi everyone, in term of code from repos, I just managed to get all public repositories from Bitbucket through their APIs, their API is limited call (1000 / hours), I have used Kaggle to create multiple notebooks (different IPs) to get it, and finally, I got progress on this. To summarize, I got 1261420 repos from bitbucket that we can download, I attached here a sample of the data, the full dataset you can be found at https://drive.google.com/file/d/13QsJRhhpL64m3jhsH4up0CBtxDIalO-A/view?usp=sharing. This data included: ['type', 'full_name', 'links', 'name', 'slug', 'description', 'scm', 'website', 'owner', 'workspace', 'is_private', 'project', 'fork_policy', 'created_on', 'updated_on', 'size', 'language', 'has_issues', 'has_wiki', 'uuid', 'mainbranch', 'override_settings', 'parent'] from repos. We can make some filters based on size, language,... I wrote a script to download all repos, we need to discuss the server and storage of this data. If we can manage to download this data with Gitlab as well as Github that we have it's will great resource. cc @ncoop57

ncoop57 commented 1 year ago

epic @PhungVanDuy!!! though it might be worth opening up a completely separate issue with this info since this issue I think is specifically only for diffs.

PhungVanDuy commented 1 year ago

epic @PhungVanDuy!!! though it might be worth opening up a completely separate issue with this info since this issue I think is specifically only for diffs.

Thank you for suggestion I just create the new one issue https://github.com/CarperAI/Code-Pile/issues/34