[bug] v1.4.2+ Hayabusa will randomly miss a few events

hitenkoku commented 2 years ago

Describe the bug

Different detection results for each run to same rule and directory

Step to Reproduce Steps to reproduce the behavior:

Download hayabusa-1.4.2-windows-64-bit.zip from release page and unzip.
Open command prompt(Run as Administrator).
Execute following command
hayabusa-1.4.2-win-x64.exe -d ../hayabusa-sample-evtx -o 1.csv
Execute following command
hayabusa-1.4.2-win-x64.exe -d ../hayabusa-sample-evtx -o 2.csv
difference detection count 3. result and 4. result.

Expected behavior

Same detection results count to same rule and directory.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: Microsoft Windows 11
hayabusa version: v1.4.2

Additional context Add any other context about the problem here.

hitenkoku commented 2 years ago

ref : https://github.com/Yamato-Security/hayabusa/pull/631#issuecomment-1202913949

hitenkoku commented 2 years ago

そもそもv1.4.2の段階で同じバイナリ、同じルール、同じディレクトリに複数回実行したときに検知回数が異なるという問題を確認しました。 Uniqueの検知回数は一致していますが、原因がよくわからない状態になっています。可能性があるとしたら時間内でのレコードの並びが実行時に一定になっていないことが原因かもしれません。以下はv1.4.2の結果となります。v1.4.1は別途確認予定

1回目 (hayabusa.exe -d hayabusa-sample-evtx -o main1.csvとして実行) Results Summary:

Total events: 62168 Data reduction: 43137 events (69.39%)

Total detections: 31206 Total critical detections: 48 Total high detections: 6132 Total medium detections: 1283 Total low detections: 6666 Total informational detections: 17077

Unique detections: 519 Unique critical detections: 16 Unique high detections: 234 Unique medium detections: 141 Unique low detections: 75 Unique informational detections: 53

Date with most total critical detections: 2019-07-19 (14) Date with most total high detections: 2016-09-20 (3653) Date with most total medium detections: 2021-04-22 (146) Date with most total low detections: 2016-09-20 (3780) Date with most total informational detections: 2016-08-19 (2105)

2回目 (hayabusa.exe -d hayabusa-sample-evtx -o main2.csvとして実行) Results Summary:

Total events: 62168 Data reduction: 43139 events (69.39%)

Total detections: 31200 Total critical detections: 48 Total high detections: 6132 Total medium detections: 1282 Total low detections: 6662 Total informational detections: 17076

Unique detections: 519 Unique critical detections: 16 Unique high detections: 234 Unique medium detections: 141 Unique low detections: 75 Unique informational detections: 53

Date with most total critical detections: 2019-07-19 (14) Date with most total high detections: 2016-09-20 (3653) Date with most total medium detections: 2021-04-22 (146) Date with most total low detections: 2016-09-20 (3780) Date with most total informational detections: 2016-08-19 (2105)

hitenkoku commented 2 years ago

v 1.4.1の結果は以下の通り。3回実行して検知数が変わらなかったためv1.4.2.での対応が原因ということが分かった

1回目
```
Results Summary:
```

Total events: 62168 Data reduction: 43137 events (69.39%)

Total detections: 31212 Total critical detections: 48 Total high detections: 6132 Total medium detections: 1284 Total low detections: 6667 Total informational detections: 17081

Unique detections: 519 Unique critical detections: 16 Unique high detections: 234 Unique medium detections: 141 Unique low detections: 75 Unique informational detections: 53

Date with most total critical detections: 2019-07-19 (14) Date with most total high detections: 2016-09-20 (3653) Date with most total medium detections: 2021-04-22 (146) Date with most total low detections: 2016-09-20 (3780) Date with most total informational detections: 2016-08-19 (2105)

- 2回目

Results Summary:

Total events: 62168 Data reduction: 43137 events (69.39%)

Total detections: 31212 Total critical detections: 48 Total high detections: 6132 Total medium detections: 1284 Total low detections: 6667 Total informational detections: 17081

Unique detections: 519 Unique critical detections: 16 Unique high detections: 234 Unique medium detections: 141 Unique low detections: 75 Unique informational detections: 53

Date with most total critical detections: 2019-07-19 (14) Date with most total high detections: 2016-09-20 (3653) Date with most total medium detections: 2021-04-22 (146) Date with most total low detections: 2016-09-20 (3780) Date with most total informational detections: 2016-08-19 (2105)


- 3回目

Total events: 62168 Data reduction: 43137 events (69.39%)

Total detections: 31212 Total critical detections: 48 Total high detections: 6132 Total medium detections: 1284 Total low detections: 6667 Total informational detections: 17081

Unique detections: 519 Unique critical detections: 16 Unique high detections: 234 Unique medium detections: 141 Unique low detections: 75 Unique informational detections: 53

Date with most total critical detections: 2019-07-19 (14) Date with most total high detections: 2016-09-20 (3653) Date with most total medium detections: 2021-04-22 (146) Date with most total low detections: 2016-09-20 (3780) Date with most total informational detections: 2016-08-19 (2105)

hitenkoku commented 2 years ago

@YamatoSecurity 確認しました。v1.4.1の内容から検知数に関連する内容は #479 のみであることが分かりました。おそらくDashMapにおける同一秒数に対して、データの投入の順序の誤差が発生してたためおそらくaggregation関連のルールの検知が発生したり、しなかったりしたことで検知件数にぶれが生じたのではないかと思われます。

改善:
- --update-rules オプションを利用する時に、更新対象のレポジトリを--rulesオプションで指定できるようにした。 (https://github.com/Yamato-Security/hayabusa/issues/615) (@hitenkoku)
- 並列処理の改善による高速化。 (https://github.com/Yamato-Security/hayabusa/issues/479) (@kazuminn)
- --outputオプションを利用したときのRulePathをRuleFileに変更した。RuleFileは出力するファイルの容量を低減させるためにファイル名のみを出力するようにした。 (https://github.com/Yamato-Security/hayabusa/issues/623) (@hitenkoku)
🐛 バグ修正:
- cargo runコマンドでhayabusaを実行するとconfigフォルダの読み込みエラーが発生する問題を修正した。 (https://github.com/Yamato-Security/hayabusa/issues/618) (@hitenkoku)

hitenkoku commented 2 years ago

一旦実装を確認し、対応策を考えます

YamatoSecurity commented 2 years ago

なるほど。バグの原因調査ありがとうございます！

YamatoSecurity commented 2 years ago

こちらでも確認できました。3回連続で実行したら、3回目で検知数が２件減りました。以下のイベントが検知されませんでした:

2016-08-18 23:49:40.000 +09:00,IE10Win7,Sec,4624,info,,4764,Logon (Type 2 Interactive) *Credentials stored in memory*,User: IEUser | Comp: IE10WIN7 | IP-Addr: 127.0.0.1 | LID: 0x1b2fd,4624_LogonType-2-Interactive.yml,../hayabusa-sample-evtx/DeepBlueCLI/many-events-security.evtx
2016-08-19 00:29:29.859 +09:00,IE10Win7,Sec,4672,info,,4806,Admin Logon,User: IEUser | PrivList: SeSecurityPrivilege SeTakeOwnershipPrivilege SeLoadDriverPrivilege SeBackupPrivilege SeRestorePrivilege SeDebugPrivilege SeSystemEnvironmentPrivilege SeImpersonatePrivilege | LID: 0x1aae1,4672_AdminLogon.yml,../hayabusa-sample-evtx/DeepBlueCLI/many-events-security.evtx

1.4.1の結果と比較したら、大体毎回数件のイベントを見逃しています。検証の時に気づかなくて申し訳ありません。

hitenkoku commented 2 years ago

ありがとうございます。私の方ではどのルールが検出されなかったのかまでは調べていなかったので助かります。ご指摘いただいた内容のルールを見てみたところ特にcountなどのaggregationconditionが利用されているようではないみたいですね……。提示してもらったルールでconditionで使われているのはandとnotくらいでした。その場合であれば1レコードで判定ができるものなので、なぜ漏れたかがよくわからないですね。もう少し確認してみます

YamatoSecurity commented 2 years ago

そうですね。10件も減っている場合もあるので、countは関係なさそうです。何かのrace conditionで特定しにくいかもしれません。

kazuminn commented 2 years ago

本件、わたくしの追加した処理が原因のようで、申し訳ございません。

原因がわかりました。何らかの原因で、タイムラインsortする処理がv1.4.1本体に取り込まれていなかったのが原因のようです。 ↓ソートする処理。 https://github.com/Yamato-Security/hayabusa/pull/622/files#diff-e6c361eb7b576e57173f4e9bce9f7b5c7d68031a3335154fe046417a4cae0fc9R246

なぜ、本体にちゃんと取り込まれていないかは、わたくしの方で把握できていない状態です。

YamatoSecurity commented 2 years ago

@kazuminn 原因特定ありがとうございます！ドンマイ！では、タイムラインsortを直せば、無事に全部検知できそうですか？

kazuminn commented 2 years ago

@YamatoSecurity 理論上はいける気がします。

kazuminn commented 2 years ago

@YamatoSecurity あー。1.4.2なんですね。見るバージョンを間違えていて、入っていました。

この話は忘れてください。

https://github.com/Yamato-Security/hayabusa/issues/639#issuecomment-1205064571

hitenkoku commented 2 years ago

ご指摘いただいた個所はタイムラインをソートする処理自体は組み込まれていると思われます。(指摘されたところはv1.4.2で入っている。v1.4.1ではそもそも入っていない) そもそも該当の処理自体は既に検知結果が入っている状態であり、実行時に検知回数が異なるところと無関係のように思われます。

YamatoSecurity commented 2 years ago

はい、1.4.1まではOKで、1.4.2以上ではランダムにイベントの検知漏れがあります。

kazuminn commented 2 years ago

@YamatoSecurity 私の開発経験が短いなりの提案ですが、並列化のバグということもあり、修正まで時間がかかりそうなので、一旦revertっぽく元に戻して、修正版をリリースするのはいかがでしょうか？その方が品質が保たれて良さそうです。

YamatoSecurity commented 2 years ago

@kazuminn OKです。良いと思います。revertの修正でお手数をかけてすみません！

hitenkoku commented 2 years ago

@kazuminn pull-requestからrevertをかけようとしましたが、他修正が入っているため、Githubのpull-requestページからでは対応ができないようでした。ブランチを切って手作業で修正してください。

fukusuket commented 2 years ago

I have same issue too.

IMHO, because the operation ↓ is not atomic, it's seem to cause a race condition. https://github.com/Yamato-Security/hayabusa/blob/v1.4.3/src/detections/message.rs#L115-L123

For example, if the threads are executed in the following order, detect_info will be lost?

MESSAGES is empty.
Thread-1 MESSAGES.get_mut(timeA) return None.
Thread-2 MESSAGES.get_mut(timeA) return None.
Thread-1 insert detect_info_1 with new vector.
Thread-2 insert detect_info_2 with new vector. <- I think detect_info_1 will be lost at this timing

I created a unit test as follow, https://github.com/fukusuket/hayabusa/commit/09a146666c12ee91a1de07cb0d155ae2f80f9be4 then the test failed 1-2 times out of 10 on my Win10(8Core/16GB RAM) and M1 Mac(8Core/8GB RAM).

If you have already investigated(or my misunderstanding), please ignore this comment🚀

YamatoSecurity commented 2 years ago

@fukusuket Thank you so much for looking into this for us and for the unit test! I thought it was due to a race condition due to the parallel processing but we didn't know for sure. Do you have any ideas of how to fix it? If we can't fix it soon, we plan on reverting back to how we did things before before the next 1.5.0 release so that we don't miss any events and then add back the parallel processing when it is stable.

fukusuket commented 2 years ago

Thank you for reply😀

Do you have any ideas of how to fix it?

I tried to fix it as follows. By using a Mutex lock(similar to version 1.4.1), prevent the race condition.

https://github.com/fukusuket/hayabusa/commit/6bac39264da614d5f230a4caf29540fe47e393d3

And I just tested it with the following and it seems to work fine.

Race condition reproduction unit test not failed.(Even if I run it more than 10 times)
Run following command 3 times and got same detection results.
- ./hayabusa -d hayabusa-sample-evtx -o out.csv
- (I haven't compare the performance, but the speed will be slower due to the mutex lock)

I've only briefly tested it so far, so I'd like to send a pull request when testing is complete.

YamatoSecurity commented 2 years ago

Thank you so much! That helps us out a bunch! Please send us a pull request when you are ready. I will be happy to help test it out.

hitenkoku commented 2 years ago

@fukusuket Thank you for your comment and test case. I will be happy to help test it.

@YamatoSecurity The problem in this case was to speed up the process by excluding Mutex. I would suggest checking to see if the time has changed significantly since v1.4.1, just to be sure.

今回のfukusuketの修正ですが、この状態になると無理にDashMapを使う理由があまりないかもしれません。(MutexでロックしているのでDashMapを使うことで想定している複数同時処理が行われない)

fukusuketが書いてもらったtestは非常に助かるので、以下のような方針で対応するのはいかがでしょうか。

v1.4.1. とfukusuketの修正分を比較して処理時間でv1.4.1の方が早い場合　　処理時間でDashMapの方が処理時間が悪化した場合は一旦fukusuketのpull requestを入れた後に処理部分をv1.4.1に戻す -fukusuketの修正分が処理時間がv1.4.1より早い場合　　fukusuketの pull requestを入れて終了

fukusuket commented 2 years ago

@hitenkoku Thank you for review!

今回のfukusuketの修正ですが、この状態になると無理にDashMapを使う理由があまりないかもしれません。

That’s true,

If I can change the implementation(Dashmap to hashbrown::HashMap) and there seems to be no problem, I will send a pull request with the hashbrown::HashMap version?

Which implementation(Dashmap or HashMap) is better for fix this issue?

hitenkoku commented 2 years ago

@fukusuket

Thank you for your comment.

I think that hashbrown::HashMap is better rather than Dashmap. DashMap is slow in this case Because DashMap do an unnecessary lock to process parallel.

fukusuket commented 2 years ago

👍 I will try the implementation with hashbrown::HashMap.

kazuminn commented 2 years ago

I think a BtreeMap is better than a HashMap. With BtreeMap, you can also parallelize the alignment of data. Even with a HashMap, it is possible if you align it every time you insert, but you will not know which one is faster until you experiment with it.

kazuminn commented 2 years ago

ちょっと根本的なDashMapの原因について、わからなかったので、Rust-jpの方で質問しましたが、わかったので、DashMapいけそうなので、ちょっと実装してみて、確認してみようと思います。 https://rust-lang-jp.zulipchat.com/#narrow/stream/124300-questions/topic/DashMap.E3.81.AE.E6.80.AA.E3.81.97.E3.81.84.E6.8C.99.E5.8B.95/near/293088218

fukusuket commented 2 years ago

Thank you for comment ^^ I test above entry(event_time).or_default(); impl, then race condition reproduction unit test never failed.(Even if I run it more than 30 times).

kazuminn commented 2 years ago

@fukusuket san. It's a good feeling. If you like, would you like to write the PR? We'll support you.

fukusuket commented 2 years ago

👍 I will try the implementation with entry(event_time).or_default(), after testing I'll send PR gladly.

fukusuket commented 2 years ago

Sorry, entry(event_time).or_default() impl still cause race condition😵 (In my dev env, race condition reproduction unit test from IDE never fails, but when I run cargo test from terminal then test failed randomly.)

pub fn insert_message(detect_info: DetectInfo, event_time: DateTime<Utc>) {
    let mut v = MESSAGES.entry(event_time).or_default(); 
    let (_, info) = v.pair_mut();
    info.push(detect_info);

I think vector initialization seems to be thread safe, but info.push(detect_info) still not thread safe?

hitenkoku commented 2 years ago

Maybe race condition of other test that use insert_message.

Could you please share the logs of the cargo test when the test fails?

tatsuya6502 commented 2 years ago

@fukusuket

info.push(detect_info) still not thread safe?

That part should be okay. Entry::or_default returns a RefMut, which contains a RwLockWriteGuard<'a, HashMap<K, V, S>> (source code). This means that the entry will be write-locked while the RefMut is alive. info.push(detect_info) is called before RefMut is dropped, so the entry is write-locked, allowing only one thread to modify the entry.

Try running cargo test like the followings:

cargo test -- --test-threads=1

Your test case modifies the DashMap instance bound to the static variable MESSAGES. By default, Cargo runs multiple tests in parallel, so if other tests also modify the same DashMap instance, these test cases will fail. Adding --test-threads=1 (doc) forces Cargo to run only one test case at a time, so only one test case will modify the DashMap instance at a time. Note that it only controls how many test cases to run in parallel, so your test case for race condition reproduction should still spawn multiple threads.

fukusuket commented 2 years ago

Thank you for supporting. I didn't know cargo tests in parallel 😅 cargo test -- --test-threads=1 failed (maybe another problem) ...

fukusuke@fukusukenoAir hayabusa % cargo test -- --test-threads=1
    Finished test [unoptimized + debuginfo] target(s) in 0.17s
     Running unittests src/lib.rs (target/debug/deps/hayabusa-1fb759f713b8a86e)

running 205 tests
test afterfact::tests::test_emit_csv_display ... error: Found argument '--test-threads' which wasn't expected, or isn't valid in this context

        If you tried to supply `--test-threads` as a value rather than a flag, use `-- --test-threads`

USAGE:
    hayabusa.exe <INPUT> [OTHER-ACTIONS] [OPTIONS]

For more information try --help
error: test failed, to rerun pass '--lib'

But, By running only the target test, the race condition no longer occurred😁

fukusuke@fukusukenoAir hayabusa % cargo test --package hayabusa --lib detections::message::tests::test_insert_message_race_condition -- --exact
    Finished test [unoptimized + debuginfo] target(s) in 0.30s
     Running unittests src/lib.rs (target/debug/deps/hayabusa-1fb759f713b8a86e)

running 1 test
test detections::message::tests::test_insert_message_race_condition ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 204 filtered out; finished in 0.05s

fukusuket commented 2 years ago

Run the following command 5 times and confirm that the number of detections is the same, so I will create a PR. hayabusa -d ../hayabusa-sample-evtx -o ...

Attach the command execution results. test-results.zip

Thank you everyone for your kind support🙇

fukusuket commented 2 years ago

I created a PR. I would appreciate it if you could test and review🙏

YamatoSecurity commented 2 years ago

@fukusuket Thanks so much! I tested it and everything is working stable and the same results as 1.4.1.

PR:

Results Summary:

Total events: 5,232,442
Data reduction: 4,141,338 events (79.15%)

Total detections: 2,273,599
Total critical detections: 66 (0.00%)
Total high detections: 431,242 (18.97%)
Total medium detections: 184,057 (8.10%)
Total low detections: 128,959 (5.67%)
Total informational detections: 1,529,275 (67.26%)

Unique detections: 349
Unique critical detections: 10 (2.87%)
Unique high detections: 91 (26.07%)
Unique medium detections: 112 (32.09%)
Unique low detections: 77 (22.06%)
Unique informational detections: 59 (16.91%)

Elapsed Time: 00:16:25.306

Hayabusa 1.4.1:

Results Summary:

Total events: 5232442
Data reduction: 4140245 events (79.13%)

Total detections: 2273599
Total critical detections: 66
Total high detections: 431242
Total medium detections: 184057
Total low detections: 128959
Total informational detections: 1529275

Unique detections: 349
Unique critical detections: 10
Unique high detections: 91
Unique medium detections: 112
Unique low detections: 77
Unique informational detections: 59

Elapsed Time: 00:17:10.145

Note: the Data Reduction events total is different because of updated calculation.

@kazuminn @hitenkoku If the PR looks good to you two as well, lets merge it.

hitenkoku commented 2 years ago

@YamatoSecurity I checked. LGTM,

Yamato-Security / hayabusa

[bug] v1.4.2+ Hayabusa will randomly miss a few events #639