Whoisdoma / WhoisParser

MIT License
13 stars 3 forks source link

Tip on dealing whois.nic.<gtld> and captcha #3

Closed mzpqnxow closed 1 year ago

mzpqnxow commented 2 years ago

Hello, I'm working on a project that does extensive WHOIS parsing on quite a large scale, so I can relate to your work. I noticed that you're documenting which gTLDs are supported by your project and even made a table of them. Some have comments about the captchas that many of the HTTP-based ones have. FWIW, I found it's relatively easy to "break" the captchas using just ImageMagick convert and tessaract. YMMV but it breaks 9 out of 10 for me. It's very hacky and unsophisticated as I'm not a computer-vision or image analysis guy- but here it goes:

#!/bin/bash
INFILE="$1"
OUTFILE="$(basename $INFILE .png)-clean.png"
convert "$INFILE" -colorspace Gray -blur 0 -level 0,60% "$OUTFILE"
tesseract "$OUTFILE" stdout --psm 8 --oem 1 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 --dpi 70

Please close this issue when you see it

Hope it's helpful