arthaud / git-dumper

A tool to dump a git repository from a website
MIT License
1.8k stars 247 forks source link

Git-dumper doesn't work in some cases when the git output have HTML content-type #25

Open DEMON1A opened 3 years ago

DEMON1A commented 3 years ago
[-] Testing https://example.com/.git/HEAD [200]
[-] https://example.com//.git/HEAD responded with HTML
arthaud commented 3 years ago

I think originally I was only checking whether the content contains "" but people had issues with that, see https://github.com/arthaud/git-dumper/pull/13 @DashLt do you know what was the issue with the original check? In the meantime you can replace line 33 of git_dumper.py with a return False.

DEMON1A commented 3 years ago

Yeah I already edited that line of code before. but the issue was still there. then i noticed there's a second layer of validation on line 73 do the same thing as 33. edited it and now it's working for me.

DashLt commented 3 years ago

Not every site has a <html> tag verbatim. Many have attributes inside the tag, e.g.:

<html class="rwd geo-override no-js vis no-rtl headerfooter-menu3 " lang="en">

It's weird that whatever webserver in the site you're attacking isn't using the application/octet-stream content-type, but it exists so it's definitely an edge case that has to be handled. As a quick and dirty thing you could check for the existence of <html, but even then that tag isn't necessarily required. I think maybe some sort of HEAD file validation is in order?

arthaud commented 3 years ago

That's also my conclusion. We would need a reference syntax checker. or we could just skip the verification on that file and fail later when we parse objects file (which need to be compressed with zlib, so that rules out html).

DEMON1A commented 3 years ago

Not every site has a tag verbatim. Many have attributes inside the tag, e.g.:

You can solve this with regex, Pattern: \<html(|.*)\>

DEMON1A commented 3 years ago

If you gonna accept the RE solution, I can do the fixes on PR if you would like.

DashLt commented 3 years ago

You can solve this with regex, Pattern: \<html(|.*)\>

https://stackoverflow.com/a/1732454

(In all seriousness, running a regex that matches that much could cause serious slowdowns on pages that can easily reach the hundreds of KB or even MB. You would also be able to send git-dumper back a very large page and make it hang as well. It's in general just a very hacky solution.)

DEMON1A commented 3 years ago

You seems to be right, but I guess in this case we don't really need that HTML content-type validation if we already know that it contains a content from the GIT folder. for example checking a string on /.git/config will be more than fine to keep fetching other stuff without caring about content-type.