digininja / CeWL

CeWL is a Custom Word List Generator
1.9k stars 255 forks source link

Exclude & Allowed Switches Not Behaving as Expected #91

Open 03k64serenity opened 2 years ago

03k64serenity commented 2 years ago

https://github.com/digininja/CeWL/blob/280bfe6f8f57a783cf447c47cfb38ad568177d00/cewl.rb#L814

When providing regex patterns in a file for the --exclude or in the command line argument for --allowed, cewl is not properly excluding and allowing offsite URLs based on the rules.

digininja commented 2 years ago

It only checks the path and not the domain looking at that line of code. Are you expecting it to check the domain as well?

On Wed, 20 Apr 2022, 22:28 03k64serenity, @.***> wrote:

https://github.com/digininja/CeWL/blob/280bfe6f8f57a783cf447c47cfb38ad568177d00/cewl.rb#L814

When providing regex patterns in a file for the --exclude or in the command line argument for --allowed, cewl is not properly excluding and allowing offsite URLs based on the rules.

— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWKFURTCXRL7DMWPRATVGBZGPANCNFSM5T5J6WPA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

03k64serenity commented 2 years ago

Right. I'd like to be able to limit the spider from crawling certain domains and allow it to crawl others based on a regex.

digininja commented 2 years ago

Not currently possible. You could easily tweak that line to check the domain instead. I don't know the property off hand, but try domain instead of path.

On Wed, 20 Apr 2022, 22:35 03k64serenity, @.***> wrote:

Right. I'd like to be able to limit the spider from crawling certain domains and allow it to crawl others based on a regex.

— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91#issuecomment-1104476287, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWPCWUAPVVQELUTC2XLVGB2CJANCNFSM5T5J6WPA . You are receiving this because you commented.Message ID: @.***>

03k64serenity commented 2 years ago

Sounds good. Will do. Hey, by the way...I had no idea you were the author of CeWL all these years seeing you on the interwebs, so I'm even more impressed and grateful for your contributions to the community.

digininja commented 2 years ago

Glad you like it.

If you get stuck, let me know, and I'll have a look for the right property in the morning.

On Wed, 20 Apr 2022, 22:40 03k64serenity, @.***> wrote:

Sounds good. Will do. Hey, by the way...I had no idea you were the author of CeWL all these years seeing you on the interwebs, so I'm even more impressed and grateful for your contributions to the community.

— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91#issuecomment-1104479216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWP4EDM6R4P3CNQ6W63VGB2TZANCNFSM5T5J6WPA . You are receiving this because you commented.Message ID: @.***>

spencer-dollahite commented 2 years ago

https://github.com/spencer-dollahite/CeWL/blob/master/cewl.rb

This is the sort of approach/feature I'd like to see to have both an allowed and exclude pattern switch for the domain and path. I know the code here isn't perfect, but I think it is close enough for demo purposes. Thoughts?

digininja commented 2 years ago

I'll have a look as soon as I get chance.

On Thu, 28 Apr 2022, 21:09 spencer-dollahite, @.***> wrote:

https://github.com/spencer-dollahite/CeWL/blob/master/cewl.rb

This is the sort of approach/feature I'd like to see to have both an allowed and exclude pattern switch for the domain and path. I know the code here isn't perfect, but I think it is close enough for demo purposes. Thoughts?

— Reply to this email directly, view it on GitHub https://github.com/digininja/CeWL/issues/91#issuecomment-1112610369, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA4SWJ4UFXCLCPHHAEDVYTVHLV77ANCNFSM5T5J6WPA . You are receiving this because you commented.Message ID: @.***>