CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

Bitbucket Code #34

Open PhungVanDuy opened 2 years ago

PhungVanDuy commented 2 years ago

Title

Dataset URL - here

Does the dataset exist in a scraped format?
URL if Yes - here

Description

Got 1261420 repos from bitbucket that we can download. This data included: ['type', 'full_name', 'links', 'name', 'slug', 'description', 'scm', 'website', 'owner', 'workspace', 'is_private', 'project', 'fork_policy', 'created_on', 'updated_on', 'size', 'language', 'has_issues', 'has_wiki', 'uuid', 'mainbranch', 'override_settings', 'parent'] from repos.

Procedure

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....
CamdenClark commented 2 years ago

I got started here using the google drive parquet file, but ran into some questions before I could make much more progress.

import pyarrow.parquet as pq
from urllib.request import urlopen
import shutil
import os

bitbucket_repos = pq.read_table("bitbucket_version_1.parquet", columns=["full_name", "mainbranch", "description", "uuid"]).to_pandas()
bitbucket_repos["mainbranch"] = bitbucket_repos["mainbranch"].apply(lambda x: x and x["name"])

def download_repo(repo):
    main_branch = repo["mainbranch"]
    full_name = repo["full_name"]
    repo_uuid = repo["uuid"]
    zip_link = "https://bitbucket.org/" + full_name + "/get/" + main_branch + ".zip"
    try:
        urlopen(zip_link)
        shutil.unpack_archive("./" + main_branch + ".zip", extract_dir = "./" + repo_uuid)
        return repo
    except:
        return None

# For now, all this does is ensure the existence of a license file
# Question: is it good enough to look for a substring of known licenses
# to ensure that the repo we're scraping has the license we would expect?
# Is there any prior art here?
def open_license(repo):
    for root, dirs, files in os.walk("./" + repo["uuid"]):
        for name in files:
            if name == "LICENSE" or name == "LICENSE.txt" or name == "LICENSE.md":
                return True
    return False

def traverse_repo(repo):
    for root, dirs, files in os.walk("./" + repo["uuid"]):
        for name in files:
            print(name)
  1. The original Pile dataset from github used stars as a way to filter out low-quality repositories. Bitbucket doesn't really have anything comparable, they do have the concept of "watchers," though. Do we care about doing this level of filtering?
  2. How do we know if a license matches one of the ones above? Is it good enough to do a substring match on a critical part of the license text?
  3. I think a lot of the open source content on bitbucket is pretty old, and some of the projects were migrated to github. Is it worth the time to include this content in this project?
PhungVanDuy commented 2 years ago

Hi @CamdenClark,

Thank you for your consideration to pick this one.

  1. Currently we do not start this one yet, so about filtering we still have not decided how to filter the repo, using watchers is a good idea but I also think about filtering the repo with very small watchers, just one thing that I concern is compared with Github, bitbucket source code is more personalize and number repo will be filtered is large. The API for this you can refer here: https://developer.atlassian.com/cloud/bitbucket/rest/api-group-repositories/#api-repositories-workspace-repo-slug-watchers-get Bitbucket limited API by IP so I used Kaggle Kernel, and Google Colab to get the list repo above, you can consider using this when you work with Bitbucket API.
  2. You are right, the problem with the license is a problem that we are having a headache about. Maybe we only check with the repo that it has a LICENCE file. We still need to discuss more this one. Should we need to sampling 1000 repos or something and stats number of repo contain LICENCE file?
  3. Actually still have a lot of repos in recent times at bitbucket when I used their API and realized that you can see the file below this is the notebook I used to get the repos created in 2022

We are currently focusing on a release based on a list of data that we are prioritizing, so decisions about bitbucket may not be clear at this point, to make things less confusing and take your time, you can look and start with other more explicit datasets like https://github.com/CarperAI/Code-Pile/issues/4 or https://github.com/CarperAI/Code-Pile/issues/33

CamdenClark commented 2 years ago

Thanks for a quick response! I will focus on gitlab then instead.