TSELab / guac-alytics

A series of tools and resources to better understand the risk profile of open source software ecosystems
Apache License 2.0
2 stars 0 forks source link

Optimize Cloning Repositories Script #47

Closed SahithiKasim closed 11 months ago

SahithiKasim commented 1 year ago

Optimize the repository cloning script for efficiency and explore faster cloning methods. Try parallel cloning or any other methods to reduce time. Also, improve error handling if there are any.

I am attaching the script as txt convert it into py! mine_repos.txt

VinhPham2106 commented 1 year ago

@SahithiKasim is there a way that I can monitor CPU and RAM usage during the cloning process on the tower?

SahithiKasim commented 1 year ago

@VinhPham2106 you can use ‘htop’ command.

VinhPham2106 commented 12 months ago

@SahithiKasim I'm seeing core_periphery.py running for several hours now and taking all CPU usage when can I test the cloning?

SahithiKasim commented 12 months ago

@VinhPham2106 You can proceed with the cloning codes. Every time something will be running in the background, so there's no need to be concerned about it.

VinhPham2106 commented 12 months ago

@SahithiKasim multithreading code, results, problems are on the attached branch. Can you check it out and try to replicate the results when you have time?

SahithiKasim commented 12 months ago

@absol27 can you take a look at it and see if multi threading speeds up your cloning process?

absol27 commented 12 months ago

@VinhPham2106 did you test locally if it speeds up the process, do you or @SahithiKasim have a token you could use?

  1. Even with the single thread approach I sometimes run into HTTP 429(too many requests) from either tracker.debian or salsa.debian, this blocks the script for some time and requires a restart.
  2. More than the computation, the speed issue is due to cloning the repos from salsa, see if you can optimize what I am doing right now. Right now I am cloning the repos twice: STEP-1 once to fetch the name of the main branch of the repo and the depth of the repo that actually matters to us(using the date range 2017-2022); STEP-2 is cloning using that repo with the branch and depth. See if you can "pack" the repo after the first clone to avoid a second clone.
  3. One alternative is pooling salsa tokens to use if one of them is rate-limited. If the rate limiting is at the network(IP) level this won't work, but you could try that.

TLDR rate-limiting limit will be reached quicker with multithreading, cloning the same number of repos. Also, keep in mind that space is an issue right now, so optimize space whatever the approach might be.