Closed bybor closed 2 years ago
Hey there,
--timeout
flag and assign it however many seconds you'd like the timeout to be instead. For something of that size you may want to have it at 10 or 15 minutes. This is a limitation of git and not github-archive..git
folder is partial, it gets downloaded before the rest of the repo assets. If it's still present, you can simply remove that folder completely and try cloning again.--pull
flag and it will pull in any changes since they were cloned. So if you wanted a complete up-to-date archive of all your repos you'd want to pass both --clone --pull
so it'll clone new things and pull updates to all of them.I'll keep this open for now, it should clean up those failed repos per the snippet I sent above. I wonder if on Windows that command isn't consistent with how it runs on Unix because it definitely should work on a Unix system.
I think that --pull
is what I actually needed, thank you for this answer!
If it helps, here is part of logs\github-archive.logs
file. I didn't change default timeout, but I set -th 1
because of another error to see if it resolves it.
2021-09-29 10:32:20,749 - INFO - # GitHub Archive started...
2021-09-29 10:32:20,750 - INFO - # Making API call to GitHub for starred repos...
2021-09-29 10:36:45,226 - INFO - Repo: Wilfred/difftastic clone success!
2021-09-29 10:40:52,583 - INFO - Repo: asamuzaK/withExEditorHost clone success!
2021-09-29 10:43:21,250 - INFO - Repo: farag2/Utilities clone success!
2021-09-29 10:46:21,141 - INFO - Repo: gildas-lormeau/SingleFile clone success!
2021-09-29 10:51:21,162 - ERROR - Git operation timed out archiving cyberduck.
2021-09-29 10:56:21,195 - ERROR - Git operation timed out archiving drawio.
2021-09-29 10:56:23,104 - INFO - Repo: joouha/euporie clone success!
2021-09-29 10:56:24,357 - INFO - Repo: junegunn/limelight.vim clone success!
2021-09-29 10:56:25,762 - INFO - Repo: junegunn/goyo.vim clone success!
2021-09-29 10:56:27,992 - INFO - Repo: junegunn/fzf.vim clone success!
2021-09-29 10:56:27,995 - INFO - GitHub Archive complete! Execution time: 0:24:07.245319.
The folder that timed out are still there:
cyberduck
I have a folder with .git
. and other folders.drawio
only .git
folder.The messages in console were different, here is a very short part of command line messages, it got cut by console window.
File "C:\python39\lib\shutil.py", line 613, in _rmtree_unsafe
_rmtree_unsafe(fullname, onerror)
File "C:\python39\lib\shutil.py", line 613, in _rmtree_unsafe
_rmtree_unsafe(fullname, onerror)
File "C:\python39\lib\shutil.py", line 613, in _rmtree_unsafe
_rmtree_unsafe(fullname, onerror)
File "C:\python39\lib\shutil.py", line 618, in _rmtree_unsafe
onerror(os.unlink, fullname, sys.exc_info())
File "C:\python39\lib\shutil.py", line 616, in _rmtree_unsafe
os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'c:\\git-archive\\repos\\jgraph\\drawio\\.git\\objects\\pack\\tmp_pack_2Uom54'
Repo: joouha/euporie clone success!
Repo: junegunn/limelight.vim clone success!
Repo: junegunn/goyo.vim clone success!
Repo: junegunn/fzf.vim clone success!
GitHub Archive complete! Execution time: 0:24:07.245319.
Actually, it's OK to close this issue. --pull
does the trick, and it's good to know that some repositories are huge even though relying on error message as indicator.
Thanks!
This is actually super helpful, thank you for sharing the output! With that in mind, looks like there is a bug where the process can't properly remove the bad folder. I'll take a look into this as it's definitely a bug. In the meantime, you could manually remove those directories and try again with a longer timeout. Glad you found the --pull
tag does what you need!
EDIT: I just tested again and found this is not a problem on Unix and must be specific to Windows. I'll do some testing on Windows and see if I can recreate and patch the problem on that OS.
I've worked a bit on this on Windows with no luck so far. I've tried to move the rm
failed folder logic to the finally
block hoping that once we released the thread we could successfully remove the folder but an error is still thrown saying a process is still using the .git
folder (appears to be some temporary file writing?) I then thought the subprocess isn't getting killed correctly and removed the shell=True
with no luck (although, this should probably still happen since we don't need to open a shell here).
If anyone is well versed in multi-threaded file IO operations on Windows, I'd love some assistance on this as I've traditionally built apps on Unix systems and don't have a lot of knowledge without deep diving into how to fix this Windows specific problem as this does not behave this way on Unix. Feel free to pick up this task and put up a PR! I'll continue to circle back as time permits.
Thank you, really appreciate you spending time on this.
I can think of some alternative approaches, but for me the most important outcome is to know that the error happened and I have to deal with it.
1) catch exception and log it without halting, suggest user to increase timeout for pulling the repository, tell that manual cleanup may be required. An error already happened (or we would not try to remove the working copy). Output "there were X errors, check logs" upon completion. If the error happens again on the next run b/c we couldn't remove the folder, it will be logged, and there will be "there were X errors, check logs" upon completion.
2) Try to remove a folder in a separate thread that will do 3 attempts to remove a folder with 1 second between attempts. It may still fail, so option 1) above could be applied. And no need to do it in a separate thread, actually - it's totally fine to make a pause to handle error. Cons against separate thread is that the main thread will need to wait for its completion.
I still think it's fine to close this bug accepting the fact that there are some limitations to what it can do. It could become a "Known issue".
I spent way too much time debugging this today and found the reason why it's failing on Windows - it's actually two fold.
shutil.rmtree
on a read-only directory gracefully and instead throws the error:
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process
As such, we need to tweak the remove_failed_dir
function to retry on failure of removal of the .git
folder by marking it writable:
def remove_failed_dir(self, path: str):
"""Removes a directory if it fails a git operation due to
timing out or other errors so it can be retried on the next run.
"""
def make_dir_writable(function, path, exception):
os.chmod(path, stat.S_IWRITE)
function(path)
if os.path.exists(path):
shutil.rmtree(path, onerror=make_dir_writable)
What I will most likely do is create a separate process that once all the git operations are complete, scans the github archive directory for folders that only contain a .git
folder (signifying the git operation failed as git folders remain while the rest of the folders don't on a failure) and clean them up instead of doing so immediately after a failed clone/pull attempt. This way we will have exited the thread and can take care of all cleanup at once instead of one-off when needed. This isn't ideal as it adds a layer of complexity to the project along with another full directory scan, but if we want to support Windows completely allowing for seamless retries on potential failures, it needs to be done. This process should also be backwards compatible with Unix systems and accomplish what is already possible on those platforms.
Another approach that is better would be to return the names of failed repos from each thread and build a list and only blow away those folders which would skip the need to rescan the entire archive, though it'd require a queue or thread manager to store those details.
@Justintime50 , thank you!
You can now take advantage of this bug fix in v4.5.0
!
I'm using below command line
github-archive.exe --https -s [user] -t [key] -l c:\git-archive --clone
What happens for me:
jgraph/drawio/.git
folder, with.git
being the only item underdrawio
.git
is consistent and valid, don't know how to checkgithub-archive
, it assumes that this repo is OKI probably wouldn't want to keep
drawio
locally, because local.git
folder was more than 700 MB, but it doesn't look quite right.And can
github-archive
"update" those cloned repositories? Is it possible to have up-to-date copy of those repositories?Thanks!