Justintime50 / github-archive

A powerful tool to concurrently clone, pull, or fork user and org repos and gists to create a GitHub archive.
MIT License
186 stars 48 forks source link

Error Removing Git Repos that Timed Out On Windows #38

Closed bybor closed 2 years ago

bybor commented 2 years ago

I'm using below command line

github-archive.exe --https -s [user] -t [key] -l c:\git-archive --clone

What happens for me:

I probably wouldn't want to keep drawio locally, because local .git folder was more than 700 MB, but it doesn't look quite right.

And can github-archive "update" those cloned repositories? Is it possible to have up-to-date copy of those repositories?

Thanks!

Justintime50 commented 2 years ago

Hey there,

  1. Large repos sadly can timeout due to their size. The default timeout for a git operation is 5 minutes via the tool. You can pass the --timeout flag and assign it however many seconds you'd like the timeout to be instead. For something of that size you may want to have it at 10 or 15 minutes. This is a limitation of git and not github-archive.
  2. github-archive should remove the directory if it times out, are you saying it remained after timing out? These lines should clean up empty failed git repos: https://github.com/Justintime50/github-archive/blob/main/github_archive/archive.py#L299-L300
  3. The .git folder is partial, it gets downloaded before the rest of the repo assets. If it's still present, you can simply remove that folder completely and try cloning again.
  4. github-archive can update those cloned repos, simply pass the --pull flag and it will pull in any changes since they were cloned. So if you wanted a complete up-to-date archive of all your repos you'd want to pass both --clone --pull so it'll clone new things and pull updates to all of them.

I'll keep this open for now, it should clean up those failed repos per the snippet I sent above. I wonder if on Windows that command isn't consistent with how it runs on Unix because it definitely should work on a Unix system.

bybor commented 2 years ago

I think that --pull is what I actually needed, thank you for this answer!

If it helps, here is part of logs\github-archive.logs file. I didn't change default timeout, but I set -th 1 because of another error to see if it resolves it.

2021-09-29 10:32:20,749 - INFO - # GitHub Archive started...

2021-09-29 10:32:20,750 - INFO - # Making API call to GitHub for starred repos...

2021-09-29 10:36:45,226 - INFO - Repo: Wilfred/difftastic clone success!
2021-09-29 10:40:52,583 - INFO - Repo: asamuzaK/withExEditorHost clone success!
2021-09-29 10:43:21,250 - INFO - Repo: farag2/Utilities clone success!
2021-09-29 10:46:21,141 - INFO - Repo: gildas-lormeau/SingleFile clone success!
2021-09-29 10:51:21,162 - ERROR - Git operation timed out archiving cyberduck.
2021-09-29 10:56:21,195 - ERROR - Git operation timed out archiving drawio.
2021-09-29 10:56:23,104 - INFO - Repo: joouha/euporie clone success!
2021-09-29 10:56:24,357 - INFO - Repo: junegunn/limelight.vim clone success!
2021-09-29 10:56:25,762 - INFO - Repo: junegunn/goyo.vim clone success!
2021-09-29 10:56:27,992 - INFO - Repo: junegunn/fzf.vim clone success!
2021-09-29 10:56:27,995 - INFO - GitHub Archive complete! Execution time: 0:24:07.245319.

The folder that timed out are still there:

The messages in console were different, here is a very short part of command line messages, it got cut by console window.

  File "C:\python39\lib\shutil.py", line 613, in _rmtree_unsafe                                                                                         
    _rmtree_unsafe(fullname, onerror)
  File "C:\python39\lib\shutil.py", line 613, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  File "C:\python39\lib\shutil.py", line 613, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  File "C:\python39\lib\shutil.py", line 618, in _rmtree_unsafe
    onerror(os.unlink, fullname, sys.exc_info())
  File "C:\python39\lib\shutil.py", line 616, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'c:\\git-archive\\repos\\jgraph\\drawio\\.git\\objects\\pack\\tmp_pack_2Uom54'
Repo: joouha/euporie clone success!
Repo: junegunn/limelight.vim clone success!
Repo: junegunn/goyo.vim clone success!
Repo: junegunn/fzf.vim clone success!
GitHub Archive complete! Execution time: 0:24:07.245319.

Actually, it's OK to close this issue. --pull does the trick, and it's good to know that some repositories are huge even though relying on error message as indicator.

Thanks!

Justintime50 commented 2 years ago

This is actually super helpful, thank you for sharing the output! With that in mind, looks like there is a bug where the process can't properly remove the bad folder. I'll take a look into this as it's definitely a bug. In the meantime, you could manually remove those directories and try again with a longer timeout. Glad you found the --pull tag does what you need!

EDIT: I just tested again and found this is not a problem on Unix and must be specific to Windows. I'll do some testing on Windows and see if I can recreate and patch the problem on that OS.

Justintime50 commented 2 years ago

I've worked a bit on this on Windows with no luck so far. I've tried to move the rm failed folder logic to the finally block hoping that once we released the thread we could successfully remove the folder but an error is still thrown saying a process is still using the .git folder (appears to be some temporary file writing?) I then thought the subprocess isn't getting killed correctly and removed the shell=True with no luck (although, this should probably still happen since we don't need to open a shell here).

If anyone is well versed in multi-threaded file IO operations on Windows, I'd love some assistance on this as I've traditionally built apps on Unix systems and don't have a lot of knowledge without deep diving into how to fix this Windows specific problem as this does not behave this way on Unix. Feel free to pick up this task and put up a PR! I'll continue to circle back as time permits.

bybor commented 2 years ago

Thank you, really appreciate you spending time on this.

I can think of some alternative approaches, but for me the most important outcome is to know that the error happened and I have to deal with it.

1) catch exception and log it without halting, suggest user to increase timeout for pulling the repository, tell that manual cleanup may be required. An error already happened (or we would not try to remove the working copy). Output "there were X errors, check logs" upon completion. If the error happens again on the next run b/c we couldn't remove the folder, it will be logged, and there will be "there were X errors, check logs" upon completion.

2) Try to remove a folder in a separate thread that will do 3 attempts to remove a folder with 1 second between attempts. It may still fail, so option 1) above could be applied. And no need to do it in a separate thread, actually - it's totally fine to make a pause to handle error. Cons against separate thread is that the main thread will need to wait for its completion.

bybor commented 2 years ago

I still think it's fine to close this bug accepting the fact that there are some limitations to what it can do. It could become a "Known issue".

Justintime50 commented 2 years ago

I spent way too much time debugging this today and found the reason why it's failing on Windows - it's actually two fold.

  1. On Windows, Python cannot call shutil.rmtree on a read-only directory gracefully and instead throws the error:
    PermissionError: [WinError 32] The process cannot access the file because it is being used by another process

    As such, we need to tweak the remove_failed_dir function to retry on failure of removal of the .git folder by marking it writable:

    def remove_failed_dir(self, path: str):
        """Removes a directory if it fails a git operation due to
        timing out or other errors so it can be retried on the next run.
        """
        def make_dir_writable(function, path, exception):
            os.chmod(path, stat.S_IWRITE)
            function(path)

        if os.path.exists(path):
            shutil.rmtree(path, onerror=make_dir_writable)
  1. We still get the same error even with this code change which is necessary. This is because to improve performance for users archiving dozens or hundreds of git assets, we use threading to spin up multiple git operations simultaneously. Windows similar to the previous problem cannot gracefully handle deleting a directory that was created while in a thread.

What I will most likely do is create a separate process that once all the git operations are complete, scans the github archive directory for folders that only contain a .git folder (signifying the git operation failed as git folders remain while the rest of the folders don't on a failure) and clean them up instead of doing so immediately after a failed clone/pull attempt. This way we will have exited the thread and can take care of all cleanup at once instead of one-off when needed. This isn't ideal as it adds a layer of complexity to the project along with another full directory scan, but if we want to support Windows completely allowing for seamless retries on potential failures, it needs to be done. This process should also be backwards compatible with Unix systems and accomplish what is already possible on those platforms.

Another approach that is better would be to return the names of failed repos from each thread and build a list and only blow away those folders which would skip the need to rescan the entire archive, though it'd require a queue or thread manager to store those details.

bybor commented 2 years ago

@Justintime50 , thank you!

Justintime50 commented 2 years ago

You can now take advantage of this bug fix in v4.5.0!