NCSC-NL / taranis3

Taranis
Other
59 stars 17 forks source link

Wordlists not working as expected #41

Closed kmizeta closed 3 years ago

kmizeta commented 3 years ago

Hi, if you put only one wordlist works as expected, that's: the item matches the word and it's marked as unreaded. This use case is working with two different wordlists (matches both). But, if you put the same two wordlist (with AND), it does't matter how this is configured and the same item can't be matched and gets the "readed" state. I'm using v3.7 of Taranis. Could you please help me with this? thanks.

markov2 commented 3 years ago

I probably see the bug: pm/Taranis/Collector.pm line 467

- $also{lc $_}   for $text =~ /$and_wl/g;
+ $also{lc $_}++ for $text =~ /$and_wl/g;

Do you know how to patch this to test it? -- Success, MarkOv


drs Mark A.C.J. Overmeer MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

kmizeta commented 3 years ago
  • kmizeta (notifications@github.com) [201016 21:53]: if you put only one wordlist works as expected, that's: the item matches the word and it's marked as unreaded. This use case is working with two different wordlists (matches both). But, if you put the same two wordlist (with AND), it does't matter how this is configured and the same item can't be matched and gets the "readed" state. I'm using v3.7 of Taranis. Could you please help me with this? thanks. I probably see the bug: pm/Taranis/Collector.pm line 467 - $also{lc $_} for $text =~ /$andwl/g; + $also{lc $}++ for $text =~ /$and_wl/g; Do you know how to patch this to test it? -- Success, MarkOv

Not really, I don't have much knowledge on Perl. Do I have to re-deploy something?, or just modify this file and restart Apache? /home/taranis/sources/taranis-3.7.4/pm/Taranis/Collector.pm allready modified it, restarted apache and I'm having the same results. Also, the same file it's on another path. Both files modified, apache restarted and I'm getting the same results. /home/taranis/taranis-3.7.4/perl5/Taranis/Collector.pm

really appreciate your help!

------------------------------------------------------------------------ drs Mark A.C.J. Overmeer MARKOV Solutions Mark@Overmeer.net solutions@overmeer.net http://Mark.Overmeer.net http://solutions.overmeer.net

kmizeta commented 3 years ago

learning coding in perl, solved by adding some code, not so efficient, but it works!

foreach my $wl (@$wordlists) { my $has_wl = $wl->{match}; my $and_wl = $wl->{and_match}; my %matchA; my %matchB;

            foreach my $text (@text) {
                    my %found;
                    $found{lc $_}++ for $text =~ /$has_wl/g;
                    $matchA{$_}++ for keys %found;
                    if($and_wl) {
                            my %also;
                            $also{lc $_}++ for $text =~ /$and_wl/g;
                            $matchB{$_}++ for keys %also;
                    }

            }
            if($and_wl){
                    if(%matchB){
                            $matches{$_}++ for keys %matchA;
                            $matches{$_}++ for keys %matchB;
                    }
            } else {
                    $matches{$_}++ for keys %matchA;
            }

    }
markov2 commented 3 years ago

Hi,

Bending my head around this code was not easy.

I think the real mistake is:

       -    $also{$_} or delete $found{$_} for keys %found;
       +    delete $also{$_} and delete $found{$_} for keys %found;

This handles a weird bump in the ancient code, where it only returns hits when they are not in both lists at the same time. That means that the second list can be used as "exclude" filter on the first... but that's not documented anywhere.

But maybe in use somewhere: have to ask around. Although I would expect that they had stumbled on the bug you found.

... read on below ...

This

                    $found{lc $_}++ for $text =~ /$has_wl/g;
                    $matchA{$_}++ for keys %found;

is equivalent to

                    $matchA{$_}++ for $text =~ /$has_wl/g;

And this

                            my %also;
                            $also{lc $_}++ for $text =~ /$and_wl/g;
                            $matchB{$_}++ for keys %also;

is equivalent to

                            $matchB{$_}++ for $text =~ /$and_wl/g;

                    }

            }

This whole block ...

            if($and_wl){
                    if(%matchB){
                            $matches{$_}++ for keys %matchA;
                            $matches{$_}++ for keys %matchB;
                    }
            } else {
                    $matches{$_}++ for keys %matchA;
            }

... is the same as

$matches{$_}++ for keys %matchA, keys %matchB;

The main functional change by your code, is that it will have the match on both list covering all text fragments (title, description and link are passed as separate texts) together, where it does match them separately in the current Taranis code.

I do agree that your expectations are more sane than the current implementation. I think I can simplify the code to this:

sub _matchingKeywords($@) {
    my ($self, $source) = (shift, shift);
    my $text = join ' ', @_;

    my $wordlists = $source->{wordlists}
        or return ();

    my %matches;
    foreach my $wl (@$wordlists) {
        my $has_wl = $wl->{match};

        my (%found, %also);
        $found{lc $_}++ for $text =~ /$has_wl/g;
        keys %found or next;

        if(my $and_wl = $wl->{and_match})
           $also{lc $_}++ for $text =~ /$and_wl/g;

           # Undocumented: use second list as exclude on the first:
           # overlapping words will be ignored.
           delete $also{$_} && delete $found{$_} for keys %found;
        }

        $matches{$_}++ for keys %found, keys %also;
    }

    [ keys %matches ];
}

-- greetz,

           MarkOv

   Mark Overmeer MSc                                MARKOV Solutions
   Mark@Overmeer.net                          solutions@overmeer.net

http://Mark.Overmeer.net http://solutions.overmeer.net

markov2 commented 3 years ago

Fixed released today in 3.7.5