Speed up git clones - Githubissues

julianharty commented 2 months ago

Context

Git clones include the full git history of each repo. For the moment we don't need or use this information as we're only querying the latest snapshot (the latest commit on the default branch), so we can consider ways to speed up the clones.

Proposal

According to https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/ git clone --filter=tree:0 <url> would be a good first step to try using.

Notes

Eventually we may mine the historical data in the repos. If and when we do, we may want to revert to making full clones.

tnzmnjm commented 1 month ago

git clone --depth=1 (shallow clone) is used in our script which is providing us with the information we need at the moment (counting test files, fetching the last commit hash). This might not be a good option as we might need the history of commits in the future.

Would you like me to change the cloning method and use the Treeless cloning instead?

I read about :

Shallow Cloning --> fetches the most recent commit from the repository, which significantly reduces the amount of disk space required compared to full clones. Analysis functions primarily focus on examining the latest state of the repository
Blobless Cloning --> While blobless clones can reduce initial clone size by deferring blob downloads, they may require additional disk space during analysis as blobs are fetched dynamically when accessing files --> not the ideal solution
and Treeless Cloning --> as our analysis functions operate on files within repository directories, maintaining the directory structure is essential for their effectiveness.

julianharty commented 1 month ago

For the moment I don't think we need to change the cloning method. Let's revisit this once we've managed to get the filenames (including their path, repo, and source, etc.) into a DataFrame or equivalent database. By then we'll probably have a better understanding of which cloning method provides us with the information we need for the data analysis we're planning to do.

I'll close this issue for now. We can reopen it, or reference it if we decide we'd like to revise the method we use.

commercetest / nlnet

Speed up git clones #63

Context

Proposal

Notes