bee-san / pyWhat

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! πŸ§™β€β™€οΈ
MIT License
6.52k stars 350 forks source link

fix: improved URL regex #230

Closed amadejpapez closed 2 years ago

amadejpapez commented 2 years ago

⚠ Pull Requests not made with this template will be automatically closed πŸ”₯

Prerequisites

Why do we need this pull request?

This should fix a few issues we were seeing with URLs. I have went through the regex and modified some parts. There may still be some cases but with this changes I saw a lot better results.

Also added more Examples and https://www.google.com now matches fully.

I have written an explanation for regex from start of the URL till the end to make it easier and quicker to review. Also give feedback, so it can get even better. :)

(?i)(?:(?:https?|ftp):\/\/)?(?:\S+:\S+@)?(?:[a-z0-9-_~]+\.)*[a-z0-9-]{1,62}\.(?:COM|IO|BLOG|ORG|TECH)(?::\d{2,5})?(?:\/[a-z0-9-_~.]+)*(?:[?#]\S*)*\/?

What GitHub issues does this fix?

Copy / paste of output

Please copy and paste the output of PyWhat with your new addition using an example that tests this addition below:

codecov-commenter commented 2 years ago

Codecov Report

Merging #230 (42491e8) into main (071a962) will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #230   +/-   ##
=======================================
  Coverage   92.60%   92.60%           
=======================================
  Files          15       15           
  Lines        1217     1217           
=======================================
  Hits         1127     1127           
  Misses         90       90           

Continue to review full report at Codecov.

Legend - Click here to learn more Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data Powered by Codecov. Last update 071a962...42491e8. Read the comment docs.

amadejpapez commented 2 years ago

Please change url generation script

What change is needed? Regex is no longer hard-coded in there.

ghost commented 2 years ago

Please change url generation script

What change is needed? Regex is no longer hard-coded in there.

Oh, that is great!