bee-san / pyWhat

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️
MIT License
6.58k stars 349 forks source link

Add support for list of regex in regex.json #244

Open GauriBodke opened 2 years ago

GauriBodke commented 2 years ago

⚠ Pull Requests not made with this template will be automatically closed 🔥

Prerequisites

Why do we need this pull request?

What GitHub issues does this fix?

Copy / paste of output

(pyWhat) C:\Users\WIN10\OneDrive\Documents\GitHub\pyWhat>python pywhat\what.py 1234567
Matched on: 1234567
Name: Phone Number
(pyWhat) C:\Users\WIN10\OneDrive\Documents\GitHub\pyWhat>python pywhat\what.py "654 2345"
Matched on: 654 2345
Name: Phone Number
(pyWhat) C:\Users\WIN10\OneDrive\Documents\GitHub\pyWhat>python pywhat\what.py "+13528880000"
Matched on: +13528880000
Name: Phone Number
Description: Location(s): United States

Matched on: 135288800
Name: American Social Security Number
Description: An American Identification Number

Matched on: 13528880000
Name: Turkish Identification Number
nodtem66 commented 2 years ago

@GauriBodke Good work. But the author became ghost. so please wait for some reviews.

My review

The author proposed a new way to improve the readability of regex.json

This PR will merge the two entries of Regexps with the same name. Then the old Regexps that have the subpattern joining with '|' need to split into multiple entries. Old regex.json

[ 
{
      "Name": "TryHackMe Flag Format",
      "Regex": "(?i)^thm{.*}|tryhackme{.*}$",
      "plural_name": false,
      "Description": "Used for Capture The Flags at https://tryhackme.com",
      "Rarity": 1,
      "URL": null,
      "Tags": [
         "CTF Flag"
      ],
      "Examples": {
         "Valid": [
            "thm{hello}"
         ],
         "Invalid": []
      }
   }
]

New regex.json

[ 
{
    "Name": "TryHackMe Flag Format",
    "Regex": "(?i)^thm{.*}$",
    "plural_name": false,
    "Description": "Used for Capture The Flags at https://tryhackme.com",
    "Rarity": 1,
    "URL": null,
    "Tags": [
       "CTF Flag"
    ],
    "Examples": {
       "Valid": [
          "thm{hello}"
       ],
       "Invalid": []
    }
 },
{
    "Name": "TryHackMe Flag Format",
    "Regex": "(?i)^tryhackme{.*}$",
    "plural_name": false,
    "Description": "Used for Capture The Flags at https://tryhackme.com",
    "Rarity": 1,
    "URL": null,
    "Tags": [
       "CTF Flag"
    ],
    "Examples": {
       "Valid": [
          "thm{hello}"
       ],
       "Invalid": []
    }
 }
]

This is different from the method proposed by @amadejpapez that uses list as the value of "Regex" key.

   {
      "Name": "TryHackMe Flag Format",
      "Regex": ["(?i)^thm{.*}$", "(?i)^tryhackme{.*}$"],
      "plural_name": false,
      "Description": "Used for Capture The Flags at https://tryhackme.com",
      "Rarity": 1,
      "URL": null,
      "Tags": [
         "CTF Flag"
      ],
      "Examples": {
         "Valid": [
            "thm{hello}"
         ],
         "Invalid": []
      }
   }

My opinion

This PR did fix the author's issue about readability. However, the main drawback is that regex.json will have a lot of duplicated entries with the same name but different Regex. It also increases the size of file regex.json unnecessarily. So, I prefer the method proposed by @amadejpapez.

amadejpapez commented 2 years ago

I agree. Having Regex key as a list and each regex format as a separate list element is what we want.

@GauriBodke Sorry if this wasn't clear enough from the start and if my comment with the example was too late. If you could change your PR to that, it would be amazing :)

You can take the THM regex list from my example and update PyWhat to work with it. Then I can go through the database and separate more regexes into lists.