CarperAI / Code-Pile

This repository contains all the code for collecting large scale amounts of code from GitHub.
MIT License
105 stars 29 forks source link

gitlab #4

Open dmahan93 opened 1 year ago

dmahan93 commented 1 year ago

Title

Dataset URL - here

Does the dataset exists in a scraped format ? No URL if Yes - [here]()

Description

Gitlab, like github, but not in bigquery

Procedure

dmahan93 commented 1 year ago

https://docs.gitlab.com/ee/api/

it has an API available, and includes diffs: https://docs.gitlab.com/ee/api/commits.html#get-the-diff-of-a-commit

imr555 commented 1 year ago

Is anyone working on this?

Currently started working on it.

Using the (https://python-gitlab.readthedocs.io/en/v3.9.0/index.html) API for this. It's a python wrapper on the official Gitlab API.

Presently Ongoing:

At present running a preliminary python script which will provides us 89 keys of all public repositories and save them all as CSV. It might take some time as there are a lot of repositories on Gitlab. (Ran a preliminary test on 20 repositories to see if this works. Passed but found an issue. The API call doesn't always return 89 keys. Sometimes some keys are missing. This is a limitation of the gl.projects.list() call. This will not be a limitation in the final script. As we will be using gl.projects.get() then, as we will have all the gitlab project ids.)

Gameplan after acquiring the preliminary CSV:

We believe this will help us filter out public repositories with low star_count much in the nature of (https://github.com/EleutherAI/github-downloader) by setting a certain threshold. After Having done that we can use the python-gitlab API along with gitlab project ids of relevant repositories from the csvs to collect diff-of-a-commit commits, issues and notes.

Would appreciate your feedback on this, @dmahan93 @ncoop57 .....

The 89 keys are,

gitlab_api_attr = ['id', 'description', 'name', 'name_with_namespace', 'path', 'path_with_namespace', 'created_at', 'default_branch', 'tag_list', 'topics', 'ssh_url_to_repo', 'http_url_to_repo', 'web_url', 'readme_url', 'avatar_url', 'forks_count', 'star_count', 'last_activity_at', 'namespace', 'container_registry_image_prefix', '_links', 'packages_enabled', 'empty_repo', 'archived', 'visibility', 'owner', 'resolve_outdated_diff_discussions', 'container_expiration_policy', 'issues_enabled', 'merge_requests_enabled', 'wiki_enabled', 'jobs_enabled', 'snippets_enabled', 'container_registry_enabled', 'service_desk_enabled', 'can_create_merge_request_in', 'issues_access_level', 'repository_access_level', 'merge_requests_access_level', 'forking_access_level', 'wiki_access_level', 'builds_access_level', 'snippets_access_level', 'pages_access_level', 'operations_access_level', 'analytics_access_level', 'container_registry_access_level', 'security_and_compliance_access_level', 'emails_disabled', 'shared_runners_enabled', 'lfs_enabled', 'creator_id', 'import_status', 'open_issues_count', 'ci_default_git_depth', 'ci_forward_deployment_enabled', 'ci_job_token_scope_enabled', 'ci_separated_caches', 'ci_opt_in_jwt', 'ci_allow_fork_pipelines_to_run_in_parent_project', 'public_jobs', 'build_timeout', 'auto_cancel_pending_pipelines', 'ci_config_path', 'shared_with_groups', 'only_allow_merge_if_pipeline_succeeds', 'allow_merge_on_skipped_pipeline', 'restrict_user_defined_variables', 'request_access_enabled', 'only_allow_merge_if_all_discussions_are_resolved', 'remove_source_branch_after_merge', 'printing_merge_request_link_enabled', 'merge_method', 'squash_option', 'enforce_auth_checks_on_uploads', 'suggestion_commit_message', 'merge_commit_template', 'squash_commit_template', 'auto_devops_enabled', 'auto_devops_deploy_strategy', 'autoclose_referenced_issues', 'keep_latest_artifact', 'runner_token_expiration_interval', 'external_authorization_classification_label', 'requirements_enabled', 'requirements_access_level', 'security_and_compliance_enabled', 'compliance_frameworks', 'permissions']

ncoop57 commented 1 year ago

@imr555 Looks awesome and you are the first person working on it! Thank you for contributing!! The game plan makes sense, especially filtering by stars, though I don't think we should set this too high, maybe keep repos that are > 2 stars. We should also filter based on license type, which might require checking the LICENSE file and parsing it as it doesn't look like there are any keys with that info :(

Does this python library use an API key to make calls? If so, we need to catalog how many requests we can make in a set interval so we don't get blocked

imr555 commented 1 year ago

@ncoop57

Yeah. I haven't still started on the game plan yet. We can start on the star_count filtration once we have the CSV.

I just called the gl.projects.list() API which claims that they can return all public repos under a certain Authentication Key(personal access token). I am using the authentication key created on permissions on Gitlab based on this (https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html#personal-access-token-scopes). (But according to the official Gitlab API documentation the limit of the request should be around 50000 pages or 2500000 repos if we count 50 repos per page.) (I have the script running on an instance I have access to. Let's see how it goes).

And of course, LICENSE details are important. I will start working on that once I get the csv..

(For licensing and star_count I will try to get those working on the 20 small test csv for tests.)

On discord, my name's Ifty if you want to ask me anything there.

Edit: Saving this a resource on good Licenses https://about.gitlab.com/handbook/engineering/open-source/

ncoop57 commented 1 year ago

Feel free to open an PR with your current script (make sure not to include your API key :) ) . That way ppl can easily contribute before the working branch gets too far away from your local code.

imr555 commented 1 year ago

Thank You. I will open a PR on the working branch then.

imr555 commented 1 year ago

@ncoop57

I will open a PR by the end of today.

Just updating on a few facts before that.

It seems the python-gitlab API does not do well at scale. It can only process 400 requests per minute.

So, hacked together a solution using some popular concurrent python libraries like async and aiohttp. (Sidenote: Don't know if Javascript or Rust might be better suited for this. Not well versed in them so took the Python route)

So, the hacked solution manages to perform around give or take 18000 requests per minute concurrently. (Sidenote: Apparently, there seems to be no limit to the number of requests to the Official Gitlab API. Other than the fact that, it can only take 1000 requests concurrently. Hence the 18000 cap we haven't been able to break

At, present, I am saving the details of each repository(project) based on the list of 89 keys in this post, https://github.com/CarperAI/Code-Pile/issues/4#issuecomment-1253481991 as individual json files. (Helps keep it resumable and there are some endline issues if we initially save them into a csv.)

We can create a csv by concatenating all the jsons once the script finishes running. (After that, we will have all the valid project ids. With can then start filtering them based on requirements like licenses and start count.)

imr555 commented 1 year ago

@ncoop57

Last update before Actually uploading the script.

You were right. There seems a limitation for gitlab requests like gitlab allows one to perform at most 2000 requests per minute.. If we request more than that we get a 429(Too many connections). So It might be good if we can get multiple private tokens and run it from different machines. To scale it.

ncoop57 commented 1 year ago

@imr555 makes sense, could you also set a cool down to your script that does like 90% of allowed requests and then waits the necessary timeout before restarting?

imr555 commented 1 year ago

@ncoop57

Cool idea.

We added that. In case of async. It's actually quite simple... If we add async.time(29). It performs a thousands requests per half minute(29 seconds wait, 1 second for request and processing), it seems. So we manage to get 2000 requests per minute to happen. We will update this once ours post processing takes more time

We are still collecting the names of the high star githubs. Once we are done we can use the scripts to download the necessary data.

PhungVanDuy commented 1 year ago

@imr555 do you have any update on this? May I can make help if you need it, GitLab data is quite important for our scale.

imr555 commented 1 year ago

@ncoop57 @PhungVanDuy So we ran the gitlab script. It captured base info of the 38 keys mentioned in https://github.com/CarperAI/Code-Pile/issues/4#issuecomment-1253481991.

There are around 694400 public repos, it seems. The script searched among 8353643 repos it seems.According to the repos , it seems there are no more repos after that.

So, we think we can move forward with gathering what is required from those repos. Please let us know what we should download from each repo. Thanks.

PhungVanDuy commented 1 year ago

I think it is code/license files if we have from the repo, also Gitlab Diff like this issue (https://github.com/CarperAI/Code-Pile/issues/31). @ncoop57 please consider this one and my bitbucket data at https://github.com/CarperAI/Code-Pile/issues/5#issuecomment-1264563142.

PhungVanDuy commented 1 year ago

@ncoop57 @PhungVanDuy So we ran the gitlab script. It captured base info of the 38 keys mentioned in #4 (comment).

There are around 694400 public repos, it seems. The script searched among 8353643 repos it seems.According to the repos , it seems there are no more repos after that.

So, we think we can move forward with gathering what is required from those repos. Please let us know what we should download from each repo. Thanks.

great progress, thank you so much. Can you share your script as well?

imr555 commented 1 year ago

Yeah sure. @PhungVanDuy

I can give you a rundown if necessary.

Apologies, I totally forgot the Discord Message.

Apparently they didn't let me upload it as .py file so I zipped it. gitlab_scr_pagi_E_R.zip

Also, lemme know If I should upload the repo details somewhere. It's around 2.9GB. I saved the details as jsons for each repo. (Easy to convert to a csv if needed.)

Best regards :)

ncoop57 commented 1 year ago

@imr555 thank you so much for this!! For the repo details, please upload them to s3 or if you don't have access to s3, maybe gdrive and @PhungVanDuy can move it to s3 once he gets permission.

Uh weird that you can't upload your script. What is the issue you are getting trying to push the file? I'd like to have this work as a PR when you get a chance and some unit tests and dummy data to evaluate the script against for reproducibility. If you open a PR, I can review and help contribute to it.

imr555 commented 1 year ago

@ncoop57 , @PhungVanDuy

I will open a PR and push the script then. With a few dummy tests as you said. And Actually I am on the road for the next two days. Some personal stuff. So I don't have access to my laptop. I will upload the data and open a PR after I am back.

Apologies for the inconvenience

imr555 commented 1 year ago

@ncoop57 @PhungVanDuy

Link: https://drive.google.com/drive/folders/1j9MC0dbmP0oE1rrefzF7ULEMw9OoG9Pu?usp=sharing

The data.zip folder in the link contains the repo details of each public repo.

gitlab_scr_pagi.py was the script used to generate the data. The code's a bit of a mess. I will refactor it and send a PR.

By the way, if you guys need help on anything else or any other data sources(issues), please let me know. Have a great day. :)

PhungVanDuy commented 1 year ago

@ncoop57 @PhungVanDuy

Link: https://drive.google.com/drive/folders/1j9MC0dbmP0oE1rrefzF7ULEMw9OoG9Pu?usp=sharing

The data.zip folder in the link contains the repo details of each public repo.

gitlab_scr_pagi.py was the script used to generate the data. The code's a bit of a mess. I will refactor it and send a PR.

By the way, if you guys need help on anything else or any other data sources(issues), please let me know. Have a great day. :)

Great! I will check it.

CamdenClark commented 1 year ago

I made some progress on this today.

  1. I filtered the data in the data.zip provided above to get all repos with 2 or more stars. This ends up being about 10000 rows.
  2. I found this endpoint (https://docs.gitlab.com/ee/api/projects.html#get-single-project) which allows you to pull down individual license information for a file. There are 2397 repos on gitlab with 2 or more stars that are permissively licensed.
  3. I then downloaded about a tenth of these to my local machine to test that cloning is working as expected.

Tomorrow I plan to implement the processing of the individual cloned repos into at least a parquet file. I'm going to rip the process from https://github.com/EleutherAI/github-downloader/blob/master/download_repo_text.py very likely, but will probably avoid multithreading as the amount of data here is relatively small.

Showing my work

import json
import os
import requests
import pandas as pd
import shutil

repos = []
for root, dirs, files in os.walk("data"):
    for repo_filename in files:
        with open("data/" + repo_filename, "r") as f:
            repo = json.loads(f.read())
            if (repo["star_count"] > 1):
                repos.append({"id": repo["id"],
                              "path_with_namespace": repo["path_with_namespace"],
                              "http_url_to_repo": repo["http_url_to_repo"],
                              "star_count": repo["star_count"],
                              "visibility": repo["visibility"]})

repos_df = pd.DataFrame(repos)
repos_df.to_csv("all_repos.csv")

def get_gitlab_project_license(project_id):
    try:
        response = requests.get("https://gitlab.com/api/v4/projects/" + str(project_id), params = {"license": True}).json()
        if ("license" in response.keys() and response["license"] is not None):
            return response["license"]["name"]
    except:
        return None
    return None

licensed_repos = []
failed_repos = []
for repo in repos:
    license = get_gitlab_project_license(repo["id"])
    if license:
        repo["license"] = license
        licensed_repos += [repo]
    else:
        failed_repos += [repo]

licensed_repos_df = pd.DataFrame(licensed_repos)
licensed_repos_df.to_csv("licensed_repos.csv")

failed_repos_df = pd.DataFrame(failed_repos)
failed_repos_df.to_csv("failed_repos.csv")

license_allowlist = set(["MIT License", 
                         "Apache License 2.0", 
                         "BSD 3-Clause \"New\" or \"Revised\" License", 
                         "Mozilla Public License 2.0", 
                         "BSD 2-Clause \"Simplified\" License", 
                         "ISC License", 
                         "The Unlicense", 
                         "Creative Commons Zero v1.0 Universal", 
                         "Eclipse Public License 1.0", 
                         "Artistic License 2.0", 
                         "BSD 3-Clause Clear License"])

permissively_licensed_repos = licensed_repos_df[licensed_repos_df["license"].isin(license_allowlist)]
permissively_licensed_repos.to_csv("permissively_licensed_repos.csv")

if 'output' not in os.listdir():
    os.makedirs('output')

def download_gitlab_repo(repo):
    file_name = repo["path_with_namespace"].split("/")[-1]
    if file_name not in os.listdir("output/"):
        os.system(f'git clone --depth 1 --single-branch {repo["http_url_to_repo"]} output/{file_name}')
        shutil.rmtree(f'output/{file_name}/.git', ignore_errors=True)
    else:
        print(f'Already downloaded {repo["http_url_to_repo"]}')

for index, repo in permissively_licensed_repos.iterrows():
    download_gitlab_repo(repo)

Two files uploaded directly to github. one is all repos with 2 or more stars (all_repos.csv) and one that is all permissively licensed repos with 2 or more stars (permissively_licensed_repos.csv).

permissively_licensed_repos.csv all_repos.csv

PhungVanDuy commented 1 year ago

I made some progress on this today.

  1. I filtered the data in the data.zip provided above to get all repos with 2 or more stars. This ends up being about 10000 rows.
  2. I found this endpoint (https://docs.gitlab.com/ee/api/projects.html#get-single-project) which allows you to pull down individual license information for a file. There are 2397 repos on gitlab with 2 or more stars that are permissively licensed.
  3. I then downloaded about a tenth of these to my local machine to test that cloning is working as expected.

Tomorrow I plan to implement the processing of the individual cloned repos into at least a parquet file. I'm going to rip the process from https://github.com/EleutherAI/github-downloader/blob/master/download_repo_text.py very likely, but will probably avoid multithreading as the amount of data here is relatively small.

Showing my work

import json
import os
import requests
import pandas as pd
import shutil

repos = []
for root, dirs, files in os.walk("data"):
    for repo_filename in files:
        with open("data/" + repo_filename, "r") as f:
            repo = json.loads(f.read())
            if (repo["star_count"] > 1):
                repos.append({"id": repo["id"],
                              "path_with_namespace": repo["path_with_namespace"],
                              "http_url_to_repo": repo["http_url_to_repo"],
                              "star_count": repo["star_count"],
                              "visibility": repo["visibility"]})

repos_df = pd.DataFrame(repos)
repos_df.to_csv("all_repos.csv")

def get_gitlab_project_license(project_id):
    try:
        response = requests.get("https://gitlab.com/api/v4/projects/" + str(project_id), params = {"license": True}).json()
        if ("license" in response.keys() and response["license"] is not None):
            return response["license"]["name"]
    except:
        return None
    return None

licensed_repos = []
failed_repos = []
for repo in repos:
    license = get_gitlab_project_license(repo["id"])
    if license:
        repo["license"] = license
        licensed_repos += [repo]
    else:
        failed_repos += [repo]

licensed_repos_df = pd.DataFrame(licensed_repos)
licensed_repos_df.to_csv("licensed_repos.csv")

failed_repos_df = pd.DataFrame(failed_repos)
failed_repos_df.to_csv("failed_repos.csv")

license_allowlist = set(["MIT License", 
                         "Apache License 2.0", 
                         "BSD 3-Clause \"New\" or \"Revised\" License", 
                         "Mozilla Public License 2.0", 
                         "BSD 2-Clause \"Simplified\" License", 
                         "ISC License", 
                         "The Unlicense", 
                         "Creative Commons Zero v1.0 Universal", 
                         "Eclipse Public License 1.0", 
                         "Artistic License 2.0", 
                         "BSD 3-Clause Clear License"])

permissively_licensed_repos = licensed_repos_df[licensed_repos_df["license"].isin(license_allowlist)]
permissively_licensed_repos.to_csv("permissively_licensed_repos.csv")

if 'output' not in os.listdir():
    os.makedirs('output')

def download_gitlab_repo(repo):
    file_name = repo["path_with_namespace"].split("/")[-1]
    if file_name not in os.listdir("output/"):
        os.system(f'git clone --depth 1 --single-branch {repo["http_url_to_repo"]} output/{file_name}')
        shutil.rmtree(f'output/{file_name}/.git', ignore_errors=True)
    else:
        print(f'Already downloaded {repo["http_url_to_repo"]}')

for index, repo in permissively_licensed_repos.iterrows():
    download_gitlab_repo(repo)

Two files uploaded directly to github. one is all repos with 2 or more stars (all_repos.csv) and one that is all permissively licensed repos with 2 or more stars (permissively_licensed_repos.csv).

permissively_licensed_repos.csv all_repos.csv

It looks great! Do you have any problem with the limitation of Gitlab API? Let me know if you need any help from us e.g. when you run batch download we can setup on our server. Thank you so much for your work.

I just have question that permissively_licensed_repos.csv is ran on all repos that we have? >5000 repos with permissively licensed quite small.

CamdenClark commented 1 year ago

Gitlab API was slow but the order of magnitude of the data is small enough to make it work. I'll shoot you a message tomorrow to get set up for batch download.

Happy to help, this is an incredible project!

CamdenClark commented 1 year ago

permissively_licensed_repos.csv is all repos with permissive licenses and 2 or greater stars. there are very few repos with 2 or greater stars alone (only around 10k)

PhungVanDuy commented 1 year ago

permissively_licensed_repos.csv is all repos with 2 or greater stars. there are very few repos with 2 or greater stars!

Can you try to run on all repos don't care about stars, in terms of user behaviors, Gitlab is quite different compared with Github.

CamdenClark commented 1 year ago

Yeah I can set that up tonight.

imr555 commented 1 year ago

@CamdenClark , First of all I want to say that exceptionally good work on finding the license and filtering the repos

As @PhungVanDuy said, it would be great if you could download all repos regardless of stars. I believe @ncoop57 told me to do that too. As Gitlab is very much different from Github.

Like, it would be great if you would able to download , commited code, code diffs, issues and repo branches from the links in the single endpoint. I believe the links for issues, code diffs, commits can be found in the metadata for each repo or project as they are called in Gitlab.

CamdenClark commented 1 year ago

Thanks all! I refactored my script that scrapes all the gitlab repo license info to use aiohttp which makes it possible to do within 24 hours I think. I will run the script today to do so.

Today I plan to organize these into real scripts that we could run sequentially, and scrape the issue data along with the repo code itself.

I'm wondering specifically what commit and diff data we are looking for. Do we want the text content of each diff or just commit messages?

Note: I haven't started on integrating this into the framework that you have for downloading, I need to spend more time figuring out how that works.

CamdenClark commented 1 year ago

so I was looking into getting all diffs from Gitlab as well, I think API calls are the wrong way to go about this.

Let's say there are about 300k repositories with permissive licensing, and each repository has on average 10 commits (pretty conservative).

That means we need to do 1 API call to list commits and 10 to get the diff of each commit. That results in ~3M api calls for each repository.

That's about 3 days just to pull each diff. And we don't get the individual content of each file before and after the commit--that's just the visual diff that's shown. So we can't get the individual content of each file without extra processing.

--

I suggest the path forward being cloning the repositories with the main branch git history (git clone --single-branch), and using local git commands in a script to get the individual file content. I think this will be much faster and will allow us to be a better citizen and not overload gitlab's API with requests.

Thoughts? I'll have a prototype tomorrow.

reshinthadithyan commented 1 year ago

Circling back, any updates @CamdenClark, @imr555 ? Let's start by collecting the latest commit in the master branch. We can later figure out about diffs.

ncoop57 commented 1 year ago

Honk honk (we are addicted to geese memes at Carper) y'all, thanks so much for such great work! @imr555 @CamdenClark could y'all work on a PR for getting the code from the latest commit as @reshinthadithyan discussed for now? We are pushing up against a deadline and wanna make sure we get this data source in.

CamdenClark commented 1 year ago

Honk honk (we are addicted to geese memes at Carper) y'all, thanks so much for such great work! @imr555 @CamdenClark could y'all work on a PR for getting the code from the latest commit as @reshinthadithyan discussed for now? We are pushing up against a deadline and wanna make sure we get this data source in.

Yes I can! I am off work today and the next so can have something to pull together.

ncoop57 commented 1 year ago

awesome!!! 😎