Performance improvement?

Shooter3k commented 3 years ago

Is there any way to improve the performance? When using -d 2 or higher, it seems to balloon in crawl times, that take days to run and always end with killing the task.

I'm not 100% sure what it's doing, but perhaps a way for it to make incremental check-ins to the output file or a current progress (of what it thinks so far) might solve the issue? or perhaps the index just gets too big for it to process efficiently?

In any case, I'm looking forward to what people suggest

digininja commented 3 years ago

There are two problems, the app can't know the size of the site before it starts, and so can't do any type of progress bar, and the app is single threaded, so every extra page adds time.

The way to improve it would be to rewrite it to be multithreaded and separate the page parsing from the spider so the spider can go as fast as it can and just throw all the pages into a parser which can then slowly chomp through them.

I have considered rewriting it a few times but never had time.

On Wed, 13 Oct 2021, 02:38 Shooter3k, @.***> wrote:

Is there any way to improve the performance? When using -d 2 or higher, it seems to balloon in crawl times, that take days to run and always end with killing the task.

I'm not 100% sure what it's doing, but perhaps a way for it to make incremental check-ins to the output file or a current progress (of what it thinks so far) might solve the issue? or perhaps the index just gets too big for it to process efficiently?

In any case, I'm looking forward to what people suggest

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/86, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWOL6FVZAAETR7RX6F3UGTPIRANCNFSM5F4ALTKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Shooter3k commented 3 years ago

I had a couple of thoughts about this that I'd like to throw out there.....

I use an application called "Screaming Frog SEO Spider" and it has a progress bar that basically shows how much it's crawled out of how much the spider has found so far. So, even in a single thread, the progress bar is constantly bouncing around a bit but it's basically showing how much it's crawled out of how much the spider has found. This is especially nice because you can judge (roughly) where things are going. In other words, if the spider finds 10,000 pages in 2 seconds and then 30,000 in 6 seconds, then you know it's going to take a really long time. Where as if the spider finds 6 pages in 2 seconds and then 12 pages in 6 seconds, you know it's probably not going to take very long. Hopefully that makes sense but here is a little screenshot of what their progress bar looks like after just a few seconds of starting a crawl.....so you can sort of guess it's probably going to take 'a long time'.
the second thought would be to add an optional parameter and have the spider dump the spider results to a file instead of indexing/crawling them. That would then give the user the option to crawl them individually (likely running cewl multiple times) on their own. If you really wanted to go the extra mile, you could also add an option for cewl to be able to crawl the results from the created file at a later time

Overall, (IMO) any option that provides some sort of arbitrary progress or 'I'm still running and this is how much I've done so far and this is how much I think I still need to do' would be helpful. Right now, using -v or --debug is the only way to validate it's still crawling and not hung up somewhere.

digininja commented 3 years ago

I could include something that said how much has been done so far, but there is no way to guess how much there is left to do and as the parsing is done as the spider returns each page, each page hit is also a parse hit so I couldn't say "hit X pages, parsed Y". The only way to do that would be to split the parser out of the spider and queue parsing as separate jobs. That would allow the spider to go at full speed but would require either writing each file to storage or a lot of memory as each page would need to be cached somewhere till it is parsed

For the second idea, saving the pages out is possible, but that would still require rewriting of the main app to split the parser out of the spider so it could then be ran on its own over the files.

The only other idea to get a progress bar that means anything would be to run the spider at full speed on its own first to get an idea of pages, then run the two combined. The problem with that is some users would fire it off, the spider would get stuck in a loop somewhere, and they would never get any data back as the actual parsing would never happen.

On Wed, 13 Oct 2021 at 14:12, Shooter3k @.***> wrote:

I had a couple of thoughts about this that I'd like to throw out there.....

1.

I use an application called "Screaming Frog SEO Spider" and it has a progress bar that basically shows how much it's crawled out of how much the spider has found so far. So, even in a single thread, the progress bar is constantly bouncing around a bit but it's basically showing how much it's crawled out of how much the spider has found. This is especially nice because you can judge (roughly) where things are going. In other words, if the spider finds 10,000 pages in 2 seconds and then 30,000 in 6 seconds, then you know it's going to take a really long time. Where as if the spider finds 6 pages in 2 seconds and then 12 pages in 6 seconds, you know it's probably not going to take very long. Hopefully that makes sense but here is a little screenshot of what their progress bar looks like after just a few seconds of starting a crawl.....so you can sort of guess it's probably going to take 'a long time'. [image: image] https://user-images.githubusercontent.com/10244114/137137575-7ce9b7c6-d407-4bf4-b9e3-82889d3fc93e.png 2.

the second thought would be to add an optional parameter and have the spider dump the spider results to a file instead of indexing/crawling them. That would then give the user the option to crawl them individually (likely running cewl multiple times) on their own. If you really wanted to go the extra mile, you could also add an option for cewl to be able to crawl the results from the created file at a later time

Overall, (IMO) any option that provides some sort of arbitrary progress or 'I'm still running and this is how much I've done so far and this is how much I think I still need to do' would be helpful. Right now, using -v or --debug is the only way to validate it's still crawling and not hung up somewhere.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/86#issuecomment-942292869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWN5G4SUZEGKJFAY6UDUGWAVFANCNFSM5F4ALTKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Shooter3k commented 3 years ago

Well, any options you're willing to do would be greatly appreciated. I love the app and use it a lot

w4po commented 1 year ago

Hello @digininja , I am trying to scrape the Ironman website to solve the last challenge of Cracking JWT keys (Obscure).

But the Cewl tool is really slow. In fact, it is just sitting there without making any requests, then after an hour or so it continues for a bit and then goes idle repeatedly.

I had to hibernate my PC twice instead of shutting it down to keep the tool working.

I am using the latest version CeWL 6.1 (Max Length) On Windows 11.

I had to use a proxy to monitor the work as there is no indication of work at all in the tool itself (It would be Cewl to show any kind of progress).

Command:

Command

Task Manager:

Task Manager

Proxy:

Proxy

Thanks for the Awesome Auth lab challenges.

digininja commented 1 year ago

I've never used it on Windows do don't know the base performance levels but it shouldn't be that slow.

I'll see if I can give it a run against the site later and see what speed I get.

On Sat, 14 Oct 2023, 05:10 w4po, @.***> wrote:

Hello @digininja https://github.com/digininja , I am trying to scrape the Ironman website to solve the last challenge of Cracking JWT keys (Obscure). https://authlab.digi.ninja/JWT_Cracking

But the Cewl tool is really slow. In fact, it is just sitting there without making any requests, then after an hour or so it continues for a bit and then goes idle repeatedly.

I had to hibernate my PC twice instead of shutting it down to keep the tool working.

I am using the latest version CeWL 6.1 (Max Length) On Windows 11.

I had to use a proxy to monitor the work as there is no indication of work at all in the tool itself (It would be Cewl to show any kind of progress). Command:

[image: Command] https://camo.githubusercontent.com/5f74a16c116c9ccbcde1c14ca1d90264ac9bbf41da5f1b366f457cd68b76df44/68747470733a2f2f692e6962622e636f2f3652376d3259422f53637265656e73686f742d323032332d31302d31342d3036353135342e706e67 Task Manager:

[image: Task Manager] https://camo.githubusercontent.com/76ca28a02135ccbeefe0ab0f667c9dabd55dee60874bc53d4ddb7664ff1a79bd/68747470733a2f2f692e6962622e636f2f3547503073326b2f53637265656e73686f742d323032332d31302d31342d3036333533332e706e67 Proxy:

[image: Proxy] https://camo.githubusercontent.com/7e5542f508596dd52e1aa3424b5002b3becfe928f6eb7a03f1daa9ffacb360b6/68747470733a2f2f692e6962622e636f2f4a7244705950742f53637265656e73686f742d323032332d31302d31342d3036333735392e706e67

Thanks for the Awesome Auth lab challenges.

— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/86#issuecomment-1762562182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWMHN4V6VZ2IOMIVJWLX7IGCFANCNFSM5F4ALTKQ . You are receiving this because you were mentioned.Message ID: @.***>

w4po commented 1 year ago

The same thing in WSL 2.0 Ubuntu, It's extremely slow, I think it starts with 1 or 2 requests per second, then the more requests it gathers the slower it becomes, It's now doing ~1 request every 30 minutes or something like that.

Maybe it's doing some comparison of the new words with the old ones to handle duplicates?

I've never used it on Windows do don't know the base performance levels but it shouldn't be that slow. I'll see if I can give it a run against the site later and see what speed I get. …

digininja commented 1 year ago

I've just installed CeWL in Ubuntu in WSL2 and against my site it is making tens of requests per second. I've also checked in native Ubuntu and Debian and they are the same, tens per second.

That is historically what it has been doing so I'd guess that it is your system that is having problems.

On Sat, 14 Oct 2023 at 12:49, w4po @.***> wrote:

The same thing in WSL 2.0 Ubuntu, It's extremely slow, I think it starts with 1 or 2 requests per second, then the more requests it gathers the slower it becomes,

Maybe it's doing some comparison of the new words with the old ones to handle duplicates?

I've never used it on Windows do don't know the base performance levels but it shouldn't be that slow. I'll see if I can give it a run against the site later and see what speed I get. … <#m5671942460851552213>

— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/86#issuecomment-1762851153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWKQM42SBQ5QRXIWLBLX7J36DANCNFSM5F4ALTKQ . You are receiving this because you were mentioned.Message ID: @.***>

w4po commented 1 year ago

It might be an issue with my system, even though I have a reasonably good one.

I've conducted some additional tests on https://www.ironman.com. Initially, it maintains a rate of 1 request per second, But after around 110 requests, it begins to slow down. By the time it reaches approximately 200 requests, the rate drops to about half a request per second.

I also tested it on your site, https://digi.ninja/. Initially, it starts with a rate of 3 requests per second for the first 10 requests. After that, there's a pause where it doesn't make any requests for a few seconds, and instead, it prints "Offsite link, not following:..."

During this "Offsite link, not following:..." phase, I attempted to stop the process using CTRL + C. It took a few seconds to stop, even though it wasn't actively making requests, just printing the message. This happened after only 10 requests, so it shouldn't have accumulated a significant amount of data (only 2500 lines).

So I think the bottleneck is somewhere in the checking phase.

PS: I'm struggling with the JWT cracking Obscure level. Can you provide any hints?

digininja / CeWL