Open PhungVanDuy opened 2 years ago
I got started here using the google drive parquet file, but ran into some questions before I could make much more progress.
import pyarrow.parquet as pq
from urllib.request import urlopen
import shutil
import os
bitbucket_repos = pq.read_table("bitbucket_version_1.parquet", columns=["full_name", "mainbranch", "description", "uuid"]).to_pandas()
bitbucket_repos["mainbranch"] = bitbucket_repos["mainbranch"].apply(lambda x: x and x["name"])
def download_repo(repo):
main_branch = repo["mainbranch"]
full_name = repo["full_name"]
repo_uuid = repo["uuid"]
zip_link = "https://bitbucket.org/" + full_name + "/get/" + main_branch + ".zip"
try:
urlopen(zip_link)
shutil.unpack_archive("./" + main_branch + ".zip", extract_dir = "./" + repo_uuid)
return repo
except:
return None
# For now, all this does is ensure the existence of a license file
# Question: is it good enough to look for a substring of known licenses
# to ensure that the repo we're scraping has the license we would expect?
# Is there any prior art here?
def open_license(repo):
for root, dirs, files in os.walk("./" + repo["uuid"]):
for name in files:
if name == "LICENSE" or name == "LICENSE.txt" or name == "LICENSE.md":
return True
return False
def traverse_repo(repo):
for root, dirs, files in os.walk("./" + repo["uuid"]):
for name in files:
print(name)
Hi @CamdenClark,
Thank you for your consideration to pick this one.
watchers
is a good idea but I also think about filtering the repo with very small watchers, just one thing that I concern is compared with Github, bitbucket source code is more personalize and number repo will be filtered is large.
The API for this you can refer here: https://developer.atlassian.com/cloud/bitbucket/rest/api-group-repositories/#api-repositories-workspace-repo-slug-watchers-get
Bitbucket limited API by IP so I used Kaggle Kernel, and Google Colab to get the list repo above, you can consider using this when you work with Bitbucket API. We are currently focusing on a release based on a list of data that we are prioritizing, so decisions about bitbucket may not be clear at this point, to make things less confusing and take your time, you can look and start with other more explicit datasets like https://github.com/CarperAI/Code-Pile/issues/4 or https://github.com/CarperAI/Code-Pile/issues/33
Thanks for a quick response! I will focus on gitlab then instead.
Title
Dataset URL - here
Does the dataset exist in a scraped format?
URL if Yes - here
Description
Got 1261420 repos from bitbucket that we can download. This data included: ['type', 'full_name', 'links', 'name', 'slug', 'description', 'scm', 'website', 'owner', 'workspace', 'is_private', 'project', 'fork_policy', 'created_on', 'updated_on', 'size', 'language', 'has_issues', 'has_wiki', 'uuid', 'mainbranch', 'override_settings', 'parent'] from repos.
Procedure
Tests
Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.
Give an example of the columns and data: