Homoglyph is set of one or more graphemes that has identical or very similar look to some other set of graphemes.
For example:
6
(DIGIT SIX) and б
(CYRILLIC SMALL LETTER BE)w
(LATIN SMALL LETTER W) and ω
(GREEK SMALL LETTER OMEGA)oo
(2 x LATIN SMALL LETTER O) and က
(MYANMAR LETTER KA)E
(LATIN CAPITAL LETTER E) and Ε
(GREEK CAPITAL LETTER EPSILON) and Е
(CYRILLIC CAPITAL LETTER IE)V
(LATIN CAPITAL LETTER V) and \/
(REVERSE SOLIDUS + SOLIDUS)Homoglyphs are:
т
in cursive in some fonts looks like m
.a
and а
homoglyphs? Sure! How about ź
and ž
? Probably yes. What will you say about R
and Я
? Er.... You see the point?;
(SEMICOLON) with ;
(GREEK QUESTION MARK) in someone's code and watch them trying to debug code that looks perfectly fine :)use HomoGlypher;
my %cyrillic = (
'6' => [ 'б' ],
'a' => [ 'а' ],
'b' => [ 'б', 'ь' ],
'r' => [ 'г' ]
);
my %greek = (
'a' => [ 'α' ],
'o' => [ 'ο' ]
);
my %myanmar = (
'oo' => [ 'က' ]
);
my $hg = HomoGlypher.new;
$hg.add-mapping( %cyrillic );
$hg.add-mapping( %greek );
$hg.add-mapping( %myanmar );
my @unwinded = $hg.unwind( 'foo' ); # [ 'foο', 'fοo', 'fοο', 'fက' ]
my @collapsed = $hg.collapse( 'бαг' ); # [ 'bar', '6ar' ]
my $randomized = $hg.randomize( 'bar', level => 80 ); # for example 'bαr'
my &tokenized = $hg.tokenize( );
say so 'bαг' ~~ / <&tokenized: 'bar'> /; # True
When dealing with homoglyphs the easiest method to debug them is to use uniname(s) method:
$ raku -e '.say for "fοο".uninames'
LATIN SMALL LETTER F
GREEK SMALL LETTER OMICRON
GREEK SMALL LETTER OMICRON
Merge given mapping (given as Hash of Arrays) with existed mappings.
Typically keys are composed from ASCII characters. Duplicates are filtered out automatically. Multi character glyphs can be used both in keys and values:
my %mapping = (
'IO' => [ 'Ю' ],
'P' => [ '|Ͻ']
);
You can inspect megred mappings under $hg.mappings
, just do not modify it directly.
If you want to fine tune it then fetch merged result, tweak it and add to new HomoGlypher
object.
Few ready to use mappings are provided in HomoGlypher::Mappings:
@basic
- ASCII letters and digits that are faked by completely different characters: ΤꜦꜪ QՍΙᴄк вᚱՕꓪɴ ꓝᏅХ
jսოр𐑈 օ𐐷еᎱ tᏥе ιαzႸ Ժօց
ОᛐշʒᏎƼỼ7ꝸᏭ
. Consists of:
%armenian
%cherokee
%cyrillic
%deseret
%greek
%greek-mathematical-typeface
%georgian
%latin
%lisu
%myanmar
%roman-numerals
%runic
%math-symbols
@typeface
- ASCII letters and digits that have typeface styles applied, base characters are not changed: 𝗧𝕳𝓔 𝒬𝕌𝕀𝙲𝔎 𝔹𝗥OW𝓝 𝘍𝕆𝗫
𝒿𝓾𝗺𝚙𝕤 𝔬𝘃𝘦𝓇 𝔱𝘩𝘦 𝖑𝖆𝕫𝔂 𝗱𝓸𝔤
𝟘𝟙2𝟹4𝟻𝟼𝟽𝟠𝟡
. Consists of:
%ballot
%ballot-bold-script
%ballot-script
%bold
%bold-fraktur
%bold-italic
%bold-script
%doublestruck
%doublestruck-italic
%fraktur
%fullwidth
%heavy-ballot
%italic
%monospace
%sansserif
%sansserif-bold
%sansserif-bold-italic
%sansserif-italic
%script
%accented
- ASCII letters that have accents applied, base characters are not changed: ȚȞȆ ꝖṲÏÇꝂ ḂŔǾⱲṆ ḞṌẌ
ĵữṁꝕṩ ǭⱱëȑ ʈẖḕ ļǟʐȳ ɗȫǵ
. Try to read it loud... Correctly :)%control
- ASCII printable representations of non printable characters: P␆ ␎ME ␖THE␏SE␞
. Have perfect similarity but letters are very crammed and those acronyms are unlikely to be found in regular language.%flipped
- ASCII letters, digits and symbols that are faked by some completely different characters in various rotations and mirroring: ꓕH⧢ Ꝺ⋂I𐐣ꓘ ꓭꓤOW𐐥 ꓞOX
jᴝᴟpƨ ᴑ⋏ǝɹ ʇɥɘ ꞁɐzʎ dᴑᵷ
0ᛚ2Ƹ4567∞9
use HomoGlypher;
use HomoGlypher::Mappings;
my $hg = HomoGlypher.new;
$hg.add-mapping( $_ ) for @HomoGlypher::Mappings::basic; # load all basic mappings
$hg.add-mapping( %HomoGlypher::Mappings::accented ); # load single, specific mapping
I won't tell you where to get perfect, complete, ultimate mapping because homoglyphs are font-dependent and similarity is subjective. Good start point for creating your own mappings are *_alphabet and *_numeral pages on Wikipedia. Or you can borrow mappings from some other projects like Codebox homoglyphs, IronGeek Homoglyph Attack Generator and many others.
Generates every possible mapping combination for your ASCII text. Beware, this works only for short inputs and list grows really, really fast.
my %cyrillic = (
'6' => [ 'б' ],
'a' => [ 'а' ],
'b' => [ 'б', 'ь' ],
'e' => [ 'е', 'ё' ],
'm' => [ 'м' ],
'p' => [ 'р' ],
'r' => [ 'г' ],
'x' => [ 'х' ]
);
my $hg = HomoGlypher.new;
$hg.add-mapping( %cyrillic );
.say for $hg.unwind( 'example' );
examplё
examрle
examрlе
examрlё
exaмple
exaмplе
exaмplё
exaмрle
...
(total 143 combinations)
Output list:
Main purpose of homoglyph unwinding is to check if someone is spoofing your domain. See ready to use IDN Checker script.
Opposite of unwind. If you have suspicious, homoglyphed text you can check which ASCII texts it might be derived from. Beware, this works only for short inputs.
my %ascii-art = (
'O' => [ '()' ],
'V' => [ '\/' ],
'W' => [ '\/\/' ]
);
my $hg = HomoGlypher.new;
$hg.add-mapping( %ascii-art );
.print for $hg.collapse( '\/()\/\/EL' );
VOVVEL
VOWEL
(as you can see sometimes it may return more than one possible ASCII text)
Main purpose of homoglyph collapsing is to check if someone is using your forums, hostings, or other services for phishing or false advertising. Check also tokenize method.
Unicode::Security module does similar thing.
Construct token that can be used to match homoglyphed text in grammars.
my %greek = (
'a' => [ 'α' ],
'r' => [ 'Γ' ],
);
my $hg = HomoGlypher.new;
$hg.add-mapping( %greek );
my &homoglyphy = $hg.tokenize( );
'foobαΓbaz' ~~ / $<result>=<&homoglyphy: 'bar'> /;
say $/{ 'result' };
「bαΓ」
Beware, token uses mappings present at match time.
You can create token without any mappings added, define grammar that uses this token and then add mappings before text is actually matched against grammar.
If you need tokens with different set of mapping in one grammar you can create and tokenize many HomoGlypher
instances.
Regex::FuzzyToken module can be used to catch misspelled phrases. Homoglypher and FuzzyToken can coexist in single grammar:
say 'Suspicious!' if $email-text ~~ / [ <fuzzy: 'paypal'> | <&homoglyphy: 'paypal'> ] /;
Will catch both papyal
(misspelled) and pαypαl
(homoglyphed). And yes, you can throw nuke on phishers and catch misspells and homoglyphs at the same time:
say 'Suspicious!' if $email-text ~~ / <fuzzy: $hg.unwind('paypal')> /;
Will catch such sneaky phrases as pαpyαl
.
Replace characters in text with homoglyphs with given probability.
my $hg = HomoGlypher.new;
$hg.add-mapping( %HomoGlypher::Mappings::flipped );
say $hg.randomize( 'DIRECTIONS & CAKE ARE A LIE', level => 100 );
⫏Iя∃C⟘IOИƧ ⅋ C∀K⧢ ∀Я∃ ∀ LI∃
Level can be given as percentage value from 1 to 100 (default 50). It decides if possible mapping should be used at given position. Do not confuse that with amount of replaced characters. For example you have mapping 'a' => [ 'α' ]
and level set to 50%. Transforming barrrr
will result with unmodified barrrr
with 50% probability (at second position transformation was possible but not used) and modified bαrrrr
with 50% probability (at second position transformation was possible and used). Each position is rolled individually against level. Each possible replacement glyph has equal chance to be picked.
Text::Homoglyph module does similar thing.