bbkr / HomoGlypher

Homoglyph toolset for Raku language.
Artistic License 2.0
4 stars 0 forks source link
homoglyph raku

Homoglyph toolset for Raku language

.github/workflows/test.yml

Homoglyph is set of one or more graphemes that has identical or very similar look to some other set of graphemes.

For example:

Homoglyphs are:

TABLE OF CONTENTS

SYNOPSIS

use HomoGlypher;

my %cyrillic = (
    '6' => [ 'б' ],
    'a' => [ 'а' ],
    'b' => [ 'б', 'ь' ],
    'r' => [ 'г' ]
);

my %greek = (
    'a' => [ 'α' ],
    'o' => [ 'ο' ]
);

my %myanmar = (
    'oo' => [ 'က' ]
);

my $hg = HomoGlypher.new;

$hg.add-mapping( %cyrillic );
$hg.add-mapping( %greek );
$hg.add-mapping( %myanmar );

my @unwinded = $hg.unwind( 'foo' );    # [ 'foο', 'fοo', 'fοο', 'fက' ]

my @collapsed = $hg.collapse( 'бαг' ); # [ 'bar', '6ar' ]

my $randomized = $hg.randomize( 'bar', level => 80 ); # for example 'bαr'

my &tokenized = $hg.tokenize( );
say so 'bαг' ~~ / <&tokenized: 'bar'> /; # True

HINT

When dealing with homoglyphs the easiest method to debug them is to use uniname(s) method:

$ raku -e '.say for "fοο".uninames'

LATIN SMALL LETTER F
GREEK SMALL LETTER OMICRON
GREEK SMALL LETTER OMICRON

METHODS

add-mapping

Merge given mapping (given as Hash of Arrays) with existed mappings.

Typically keys are composed from ASCII characters. Duplicates are filtered out automatically. Multi character glyphs can be used both in keys and values:

my %mapping = (
    'IO' => [ 'Ю' ],
    'P' => [ '|Ͻ']
);

You can inspect megred mappings under $hg.mappings, just do not modify it directly. If you want to fine tune it then fetch merged result, tweak it and add to new HomoGlypher object.

Few ready to use mappings are provided in HomoGlypher::Mappings:

use HomoGlypher;
use HomoGlypher::Mappings;

my $hg = HomoGlypher.new;

$hg.add-mapping( $_ ) for @HomoGlypher::Mappings::basic;    # load all basic mappings
$hg.add-mapping( %HomoGlypher::Mappings::accented );        # load single, specific mapping

I won't tell you where to get perfect, complete, ultimate mapping because homoglyphs are font-dependent and similarity is subjective. Good start point for creating your own mappings are *_alphabet and *_numeral pages on Wikipedia. Or you can borrow mappings from some other projects like Codebox homoglyphs, IronGeek Homoglyph Attack Generator and many others.

unwind

Generates every possible mapping combination for your ASCII text. Beware, this works only for short inputs and list grows really, really fast.

my %cyrillic = (
    '6' => [ 'б' ],
    'a' => [ 'а' ],
    'b' => [ 'б', 'ь' ],
    'e' => [ 'е', 'ё' ],
    'm' => [ 'м' ],
    'p' => [ 'р' ],
    'r' => [ 'г' ],
    'x' => [ 'х' ]
);

my $hg = HomoGlypher.new;
$hg.add-mapping( %cyrillic );

.say for $hg.unwind( 'example' );
examplё
examрle
examрlе
examрlё
exaмple
exaмplе
exaмplё
exaмрle
...

(total 143 combinations)

Output list:

Main purpose of homoglyph unwinding is to check if someone is spoofing your domain. See ready to use IDN Checker script.

collapse

Opposite of unwind. If you have suspicious, homoglyphed text you can check which ASCII texts it might be derived from. Beware, this works only for short inputs.

my %ascii-art = (
    'O' => [ '()' ],
    'V' => [ '\/' ],
    'W' => [ '\/\/' ]
);

my $hg = HomoGlypher.new;
$hg.add-mapping( %ascii-art );

.print for $hg.collapse( '\/()\/\/EL' );
VOVVEL
VOWEL

(as you can see sometimes it may return more than one possible ASCII text)

Main purpose of homoglyph collapsing is to check if someone is using your forums, hostings, or other services for phishing or false advertising. Check also tokenize method.

Unicode::Security module does similar thing.

tokenize

Construct token that can be used to match homoglyphed text in grammars.

my %greek = (
    'a' => [ 'α' ],
    'r' => [ 'Γ' ],
);

my $hg = HomoGlypher.new;
$hg.add-mapping( %greek );

my &homoglyphy = $hg.tokenize( );

'foobαΓbaz' ~~ / $<result>=<&homoglyphy: 'bar'> /;
say $/{ 'result' };
「bαΓ」

Beware, token uses mappings present at match time. You can create token without any mappings added, define grammar that uses this token and then add mappings before text is actually matched against grammar. If you need tokens with different set of mapping in one grammar you can create and tokenize many HomoGlypher instances.

Regex::FuzzyToken module can be used to catch misspelled phrases. Homoglypher and FuzzyToken can coexist in single grammar:

say 'Suspicious!' if $email-text ~~ / [ <fuzzy: 'paypal'> | <&homoglyphy: 'paypal'> ] /;

Will catch both papyal (misspelled) and pαypαl (homoglyphed). And yes, you can throw nuke on phishers and catch misspells and homoglyphs at the same time:

say 'Suspicious!' if $email-text ~~ / <fuzzy: $hg.unwind('paypal')> /;

Will catch such sneaky phrases as pαpyαl.

randomize

Replace characters in text with homoglyphs with given probability.

my $hg = HomoGlypher.new;
$hg.add-mapping( %HomoGlypher::Mappings::flipped );

say $hg.randomize( 'DIRECTIONS & CAKE ARE A LIE', level => 100 );
⫏Iя∃C⟘IOИƧ ⅋ C∀K⧢ ∀Я∃ ∀ LI∃

Level can be given as percentage value from 1 to 100 (default 50). It decides if possible mapping should be used at given position. Do not confuse that with amount of replaced characters. For example you have mapping 'a' => [ 'α' ] and level set to 50%. Transforming barrrr will result with unmodified barrrr with 50% probability (at second position transformation was possible but not used) and modified bαrrrr with 50% probability (at second position transformation was possible and used). Each position is rolled individually against level. Each possible replacement glyph has equal chance to be picked.

Text::Homoglyph module does similar thing.