Timeout and running out of Ram issues on very large Repos, possible to limit number of parallel git calls?

8465231 commented 3 years ago

So been using this script a bit now and it is great for smaller repos but ran into a fairly consistent issue with very large repos such as LineageOS (around ~400-500GB and ~2500 sub-repos).

It will first have a lot of timeout errors in the logs no matter how long I set the timeout to. I have also tried increasing the delay on the calls to 20 seconds but this causes this to take forever and I still got timeout errors so I canceled it.

Then it will eventually cause my system with 64GB of ram to run out of memory.

It appears that it is starting a limitless number of git calls as a top on the system shows git commands running as far as I can scroll.

I am thinking that limiting the number of active git calls would fix both issues. Is it possible to set the number of threads / git calls that will be run in parallel?

Justintime50 commented 3 years ago

Dang! Sounds like LineageOS needs to be optimized for git...

This is probably a smart idea to limit the amount of open threads at once, thanks for submitting this! I can take a look over the coming weeks. I have a couple of other improvements I plan to make.

As a temporary workaround, I'd suggest manually cloning LineageOS and moving the cloned directory to the ~/github-archive/repos/lineageos folder. The next time you run GitHub archive, it will simply pull changes from your last git operation instead of trying to clone it again.

Can you also clarify for me how many repos you are using github-archive for right now to clone/pull changes for?

8465231 commented 3 years ago

In this case LineageOS was the only org (it is an org, not an individual repo I should point out, it is so large due to it being a privacy orientated android fork and having the code for many different phones) it was trying to clone as I am still testing and was trying to narrow down the timeout/crashing issues.

On a whim I set the delay for each call to 90 seconds and let it run and it did finish without errors but naturally took a few days to do it.

So limiting the number of threads would seem to solve these issues.

If it could download the org users repos as well it would be perfect for my needs but still saves a lot of time and I am looking for options to get the user repos now.

Thanks for all your work on this.

Justintime50 commented 3 years ago

Worked up a proof of concept here as I've never done this before, seems to work well. Will incorporate into the project soon: https://gist.github.com/Justintime50/3a415006006bffdeb3a78fa81b7856b4.

How many open concurrent threads do you feel is a good max? My initial un-informed answer would be something like 10.

8465231 commented 3 years ago

I am still very new to coding (with exactly zero experience with python) so can't help much on the raw code I am afraid but great that you are figuring it out!

The max number of threads will vary a lot depending on how fast the system / internet / drives are. I think that ~4 would be a good default and ideally it would be adjustable via an argument or variable.

For example if downloading to an HDD it will not be able to handle nearly as many threads as downloading to an SSD.

Maybe I should create a new ticket for this but another issue I have run into a few times is "API limit reached", it seems that github has a 5000 API call limit per 6 hours.

How many calls does the script make per repo?

Justintime50 commented 3 years ago

GitHub Archive doesn't make any API calls "per-repo", it makes a couple API calls to get the list of repos and their info from the API as a large list and then on your machine iterates over the results and runs vanilla git commands locally, if you're running into API limit reached errors, you may be doing other things with GitHub that is having you reach those limits.

I think I'm going to do a limit of 10 threads at any given time as the default but will provide a configuration option where this can be overridden and you can specify something like 4 as you mentioned above.

8465231 commented 3 years ago

Interesting, the only thing I was using at the time when I got the API error was this script but possible it was not past the 6 hour mark and had left over calls from other scripts I was testing.

Justintime50 commented 3 years ago

I finally ran into timeout errors while trying to archive ~450 repos at once, a few towards the end started timing out. I believe limiting the threads will help and plan to work on that in the next few days.

8465231 commented 3 years ago

Yeah, I only really had the issue when doing very large orgs with hundreds of repos. Not sure why.

Justintime50 commented 3 years ago

I made the threading changes and tested locally and this did not actually fix the problem (although it was a much needed fix regardless) - I cloned a large project cpython outside the context of this tool and it took almost 3 minutes to clone the one repo (which at the present is the default timeout for git operations via the GitHub Archive tool.

At the end of the day, I believe that some repos take much longer to clone than others due to their git history and included assets (cpython is 385mb which is massive for a git repo) - so I'll be bumping the default timeout to 5 minutes (this will be configurable).

In the meantime, without having to need to wait for the next update, you could pass the GITHUB_ARCHIVE_TIMEOUT environment variable and provide the number of seconds you'd like it to be (currently 180, can bump that up to whatever you need) which should get you around the timeout errors.

Justintime50 commented 3 years ago

Re-reading your initial issue, I'd venture to say that cloning a repo such as LineageOS via this tool is not advisable due to its size. I'd clone it separately, move it to the subdirectory of your archive from this tool, and then continue to pull in changes via the tool once it's been initially cloned. Alternatively you could set a really long timeout.

8465231 commented 3 years ago

Re-reading your initial issue, I'd venture to say that cloning a repo such as LineageOS via this tool is not advisable due to its size. I'd clone it separately, move it to the subdirectory of your archive from this tool, and then continue to pull in changes via the tool once it's been initially cloned. Alternatively you could set a really long timeout.

The issue with lineage is that while each individual repo is not that bad, there are around ~1500-2000 repos under the org, thats where the massive size and issues came into play. While it is the largest repo I have done so far a lot of the orgs I tend to download are several hundred MB or even GB in size with all the repos.

I tend to follow the giant net rule, if this org / user has one thing I find useful, good chance they have something else as well. Might as well download it all. I have 100TB to fill lol.

Still the changes you are implementing should fix a lot of the issues and if I do get timeout errors still I suppose I could always just run the clone again to pick up the missing parts?

Justintime50 commented 3 years ago

These changes have been merged and will be included in the next major release.

Justintime50 / github-archive

Timeout and running out of Ram issues on very large Repos, possible to limit number of parallel git calls? #22