cikl / threatinator

GNU Lesser General Public License v3.0
18 stars 5 forks source link

Handling Unicode in a feed #2

Open pierre427 opened 10 years ago

pierre427 commented 10 years ago

Feed config:

provider "dragon"
name "ssh_ip_reputation"
fetch_http('http://www.dragonresearchgroup.org/insight/sshpwauth.txt')

feed_re = /(?<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/

filter_whitespace
filter_comments

parse_eachline(:separator => "\n") do |event_generator, record|
  m = feed_re.match(record.data)
  next if m.nil?

  event_generator.call() do |event|
    event.type = :scanning
    event.add_ipv4(m[:ip]) do |ipv4_event|
    end
  end
end

Feed data where it breaks:

21826        |  Corporación Telemic C.A.,VE    |   200.75.106.101  |  2014-06-26 09:55:04  |  sshpwauth

Error message:

dragon,ssh_ip_reputation,scanning,66.181.8.250,,,,,,,
/root/threatinator/lib/threatinator/filters/whitespace.rb:14:in `match': invalid byte sequence in US-ASCII (ArgumentError)
        from /root/threatinator/lib/threatinator/filters/whitespace.rb:14:in `filter?'
        from /root/threatinator/lib/threatinator/feed_runner.rb:66:in `block in parse_record'
        from /root/threatinator/lib/threatinator/feed_runner.rb:66:in `each'
        from /root/threatinator/lib/threatinator/feed_runner.rb:66:in `any?'
        from /root/threatinator/lib/threatinator/feed_runner.rb:66:in `parse_record'
        from /root/threatinator/lib/threatinator/feed_runner.rb:54:in `block in run'
        from /root/threatinator/lib/threatinator/parsers/getline.rb:89:in `block in each'
        from /root/threatinator/lib/threatinator/parsers/getline.rb:83:in `loop'
        from /root/threatinator/lib/threatinator/parsers/getline.rb:83:in `each'
        from /root/threatinator/lib/threatinator/feed_runner.rb:53:in `run'
        from /root/threatinator/lib/threatinator/runner.rb:41:in `run'
        from /root/threatinator/lib/threatinator/cli.rb:51:in `do_run_command'
        from /root/threatinator/lib/threatinator/cli.rb:113:in `block (3 levels) in process!'
        from /usr/local/lib/ruby/gems/1.9/gems/slop-3.5.0/lib/slop.rb:260:in `call'
        from /usr/local/lib/ruby/gems/1.9/gems/slop-3.5.0/lib/slop.rb:260:in `parse!'
        from /usr/local/lib/ruby/gems/1.9/gems/slop-3.5.0/lib/slop.rb:235:in `parse!'
        from /usr/local/lib/ruby/gems/1.9/gems/slop-3.5.0/lib/slop.rb:65:in `parse!'
        from /root/threatinator/lib/threatinator/cli.rb:85:in `process!'
        from bin/threatinator:5:in `<main>'
root@threatinator:~/threatinator # 

Looks like the unicode breaks parsing the line.

justfalter commented 10 years ago

Yup, that's undeniable. I remember thinking "maybe I should add encoding handling..." when I was starting on this.

We'll likely have to add an option to specify the source encoding type with each feed.