commercetest / nlnet

Analysis of the opensource codebases of NLnet sponsored projects.
MIT License
0 stars 0 forks source link

Speed up git clones #63

Closed julianharty closed 1 month ago

julianharty commented 2 months ago

Context

Git clones include the full git history of each repo. For the moment we don't need or use this information as we're only querying the latest snapshot (the latest commit on the default branch), so we can consider ways to speed up the clones.

Proposal

According to https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/ git clone --filter=tree:0 <url> would be a good first step to try using.

Notes

Eventually we may mine the historical data in the repos. If and when we do, we may want to revert to making full clones.

tnzmnjm commented 1 month ago

git clone --depth=1 (shallow clone) is used in our script which is providing us with the information we need at the moment (counting test files, fetching the last commit hash). This might not be a good option as we might need the history of commits in the future.

Would you like me to change the cloning method and use the Treeless cloning instead?

I read about :

julianharty commented 1 month ago

For the moment I don't think we need to change the cloning method. Let's revisit this once we've managed to get the filenames (including their path, repo, and source, etc.) into a DataFrame or equivalent database. By then we'll probably have a better understanding of which cloning method provides us with the information we need for the data analysis we're planning to do.

I'll close this issue for now. We can reopen it, or reference it if we decide we'd like to revise the method we use.