digininja / CeWL

CeWL is a Custom Word List Generator
1.96k stars 258 forks source link

Tel: Protocol Mishandled #98

Open landoncrabtree opened 2 years ago

landoncrabtree commented 2 years ago

Hi,

Trying to generate a wordlist, and the webpage being crawled has a hyperlink to tel:. From the error, it seems it is trying to be parsed as a URI, which errors:

Error: #<NoMethodError: undefined method `request_uri' for #<URI::Generic tel:+1[redacted]>>
Error: ["./cewl.rb:810:in `block (3 levels) in <main>'", "/Library/Ruby/Gems/2.6.0/gems/spider-0.5.4/lib/spider/spider_instance.rb:207:in `block in allowable_url?'", "/Library/Ruby/Gems/2.6.0/gems/spider-0.5.4/lib/spider/spider_instance.rb:207:in `map'", "/Library/Ruby/Gems/2.6.0/gems/spider-0.5.4/lib/spider/spider_instance.rb:207:in `allowable_url?'", "./cewl.rb:172:in `block (2 levels) in start!'", "./cewl.rb:171:in `select'", "./cewl.rb:171:in `block in start!'", "./cewl.rb:163:in `each'", "./cewl.rb:163:in `start!'", "./cewl.rb:115:in `start_at'", "./cewl.rb:776:in `block in <main>'", "./cewl.rb:766:in `catch'", "./cewl.rb:766:in `<main>'"]

Using the latest version of CeWL-- just built it yesterday.

Not able to test, but this might also be replicatable for other protocols: ftp://, mailto:, sms:, etc. I say the easiest way to avoid this is to just not attempt to crawl non http/s protocols.

Thanks in advance!

digininja commented 2 years ago

This commit should fix it.

https://github.com/digininja/CeWL/commit/e95cfaa4c8e4050999a54508fed4e0fc242fe106