Extract domain names without URI scheme

deadbits commented 5 years ago

I was trying to pull out a list of domains from a text file input (sample of input / expected output below), but iocextract doesn't recognize anything without a URI scheme I think.

Is it possible to include an --extract-domains, or have --extract-urls optionally ignore the scheme for instance? Just random thoughts, not sure the best way to handle this given how complicated the regex is.

If it's any help, this pattern ([a-zA-Z0-9-_]+(\.)+)?([a-z0-9-_]+)*\.+[a-z]{2,63} should match pretty much any domain name up to the TLD.

matches:

google.com
foo.mywebsite.io
hack-the-planet.com
asdf-fdsa.foo-bar.com
foo-bar.domain.name.com

Sample Input

GLOBAL
Pool    Location    Total Fee/Donations Hashrate    Miners  Link
supportXMR.com
PPLNS exchange payout custom threshold workerIDs email monitoring SSL Android APP   DE,FR,US,CA,SG  0.6 %   86.79 MH/s  7228
xmrpool.net
PPS PPLNS SOLO exchange payout custom threshold workerIDs email monitoring SSL  USA/EU/Asia 0.4-0.6 %   642.32 KH/s 179
xmr.nanopool.org
PPLNS exchange payout workerIDs email monitoring SSL    USA/EU/Asia 1 % 105.52 MH   6155
 minergate.com
possible share skimming! People complaining about poor hashrate.
RBPPS PPLNS USA/EU  1-1.5 % 26.50 MH/s  37467
viaxmr.com
PPLNS exchange payout custom threshold workerIDs email monitoring SSL   US/UK/AU/JP 0.4 %    API problem     API problem
monero.hashvault.pro

Was hoping to get output of:

supportXMR.com
xmrpool.net
monero.hashvault.pro
minergate.com

deadbits commented 5 years ago

Well.. my regex above was close to working in Python. Not quite how I thought though.

rshipp commented 5 years ago

Unfortunately there are no plans to add domain support at the moment. We explicitly left out domain extraction because the false positives are extremely high.

My recommendation here would be to use the custom regex support to add any regexes you'd like to use. This is supported by both the CLI and the library. Let me know if you have any questions getting that set up.

deadbits commented 5 years ago

Sounds good to me. I'll probably rely on the custom regex for the time being.... It's trickier than I thought..

Thanks for the quick response! On Dec 6, 2018, 3:35 PM -0500, Ryan Shipp notifications@github.com, wrote:

Unfortunately there are no plans to add domain support at the moment. We explicitly left out domain extraction because the false positives are extremely high. My recommendation here would be to use the custom regex support to add any regexes you'd like to use. This is supported by both the CLI and the library. Let me know if you have any questions getting that set up. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

InQuest / iocextract

Extract domain names without URI scheme #25