VirusTotal / yara-python

The Python interface for YARA
http://virustotal.github.io/yara/
Apache License 2.0
646 stars 179 forks source link

Fix #170 #172

Closed ruppde closed 3 years ago

ruppde commented 3 years ago

Speedup of *2-4 in 1000+ rules with metadata.

This PR isn't intented for merging, just to show the speedup of not running this foreach-block for every scanned file but only when needed.

Someone with c skills should do it properly ;)

google-cla[bot] commented 3 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

wxsBSD commented 3 years ago

I think you have found a good opportunity to optimize. As I see it, we are creating all the parts that go into a match object even if the rule doesn't match AND the non-matching callback flag is not set. Basically, we go through all that work for non-matching rules only to never do anything with them. This is, indeed, a lot of unnecessary work for what is likely to be a common case. Excellent find!

I'm taking a closer look at this code now and hope to have either a comment on your PR or my own version of it up soon.

wxsBSD commented 3 years ago

Your PR is not quite right.

In the case of a non-match and no-callback the code is effectively doing nothing (this is correct). When a callback is provided the tag, meta and string lists are populated and the dictionary is created, then the callback is called (this is also correct).

In the case of a match we populate the tag, meta and strings list, then create the match object (this is also correct). But if a callback is provided you are double adding to the lists, creating the dictionary for the callback and then calling the callback.

It seems to me like a cleaner way to achieve the same thing is by checking if the message is a non-match and no callback is provided or if the user does not want to see non-matches, and returning early if that is the case. I put up a commit (https://github.com/wxsBSD/yara-python/commit/38063a87b4b5bbecefabc4e4a5fa3887c15fca90) which does this and I appreciate anyone's opinion on it. The goal is to only create the lists if we 1) have a match or 2) have a non-match and the user wants to see it via their callback.

Lastly, I'm having trouble replicating such large performance gains that you're claiming.

ruppde commented 3 years ago

yes, didn't think about the double adding.

I tested with the 2481 rules (10 meta: lines each) from https://valhalla.nextron-systems.com/ (click "demo" and "get rules") but you'll see the effect with any .yar with 1000+ rules and multiple lines of metadata. but your patch doesn't show the speedup.

the most elegant way of fixing it is imho to put these 3 foreach loops in a function and call it right before the data is needed. then the logic is only once in the code.

wxsBSD commented 3 years ago

Other than making the code a little cleaner I don't see how a function is going to do anything relevant. I'm having a hard time replicating your results and even understanding how your suggestion will work.

The code, as it is in master right now, only makes the tag, strings and meta lists once (https://github.com/VirusTotal/yara-python/blob/master/yara-python.c#L792). It does it in the case of both a matching rule and a non-matching rule. Then, if the rule is matching it makes a new match object and appends it into the list of matches (https://github.com/VirusTotal/yara-python/blob/master/yara-python.c#L851). Lastly, if there is a callback and the user specifies the appropriate which flag a new dictionary is populated before being sent to the user provided callback (https://github.com/VirusTotal/yara-python/blob/master/yara-python.c#L878). The metadata, strings and tags are populated for both matching AND non-matching rules, even if we are never going to call the callback for the non-matching ones (this is the part my patch is trying to target).

My patch is attempting to only make the tag, strings and meta lists if the rule is matching OR if it is non-matching and the user has specified a callback and that they want to see non-matching rules. In either case it still only populates the lists once, adds it to the list of matches if appropriate and then if there is a callback and the user wants to see this match (or non-match) a new dictionary is created and passed to the callback. Essentially, my diff is attempting to optimize the case where we have a non-match and the user wants to see ONLY matches in the callback, in which case we can avoid populating the lists only to do nothing with them.

Even if we solve the double add problem I am having a hard time seeing how your diff makes any difference here. We still need to spend time creating the tag, strings and meta lists for matches and non-matches (because we need to pass them to the callback). I think you might be seeing such a significant performance improvement because the majority of the rules are not matching on your samples (it isn't clear how many files you are scanning in total) so you're just not creating the lists and you're also not providing a callback so the double add isn't happening either. I think, in the case of a non-matching rule all the code in yara_callback is a never triggering any of the branches?

I still think there is an opportunity for a small optimization when there is a non-matching rule and the user provides a callback and only wants to see matches in the callback. In this case I still think we can return early and save a small bit of time creating those lists only to do nothing with them.

ruppde commented 3 years ago

See https://github.com/VirusTotal/yara-python/issues/170

google-cla[bot] commented 3 years ago

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

ruppde commented 3 years ago

Now here's a better fix: If basically moved the 2 if blocks forward and store the results in

runblock_match = true;
runblock_callback = true;

Then the expensive foreach loops run only if needed:

if ( runblock_match || runblock_callback )

It's easy to check, that it does exactly the same as the old code, just lots faster (on 1000+ rules with meta: like e.g. https://valhalla.nextron-systems.com/)

wxsBSD commented 3 years ago

I believe your latest diff and my diff are accomplishing the same result, but I believe mine is a bit cleaner and easier to understand. Would you be willing to test mine? https://github.com/wxsBSD/yara-python/tree/non_match_perf

ruppde commented 3 years ago

yes, I've checked it and it doesn't help. You can easily check on your own: Download the 2400 valhalla rules, use whatever python scanner and compare outputs of

time ./my_favorite_yara_scanner.py valhalla-rules.yar ~/samples
wxsBSD commented 3 years ago

Please test the latest yara-python as my fix is now merged. I think this PR can be closed.

ruppde commented 3 years ago

This doesn't fix yara.Rules.match()

wxsBSD commented 3 years ago

This doesn't fix yara.Rules.match()

Can you provide more context on what you think is wrong?

ruppde commented 3 years ago

Dear author of https://github.com/VirusTotal/yara-python/pull/175,

two weeks ago I already got the gut feeling: "This guy is going to claim credit for finding this bug" (which has been lying there for years). When I thought about it, it were these reasons:

  1. You claimed multiple times, that could couldn't reproduce the problem, without even trying the demo code I provided.
  2. You neither gave the slightest hint how you tried to reproduce it (because it would have resulted in me helping you).
  3. So you've artificially created a problem ("this is very hard to reproduce")
  4. At that point it was a no brainer to check if you have a twitter account and tatata: https://twitter.com/wxs/status/1370217850212847618

@wxs: "The report was not easy to reproduce but I did find the bug quickly."

slow clap

@wxs: "This is why when you are reporting bugs be as explicit as you can and provide EXACT commands and input files."

Now this is over the border. You publicly blame me, who provided demo code, named rules plus the hint for "small files" (which is superfluously if someone has really understood the bug) for not reporting EXACT, while at the same time not using my information or giving a single word on how you tried to reproduce it.

Anyways, on with the thread. Your solution for your artificial problem: "increased the number of meta lines from 9999 to 99999", which is of course needed if you use files of 1MB size instead of using "small files" as proposed by me. (Which is by the way the total natural way of reproducing this bug because it's clear that the overhead happens per file. So the more files you scan per time, the clearer it shows. You could say: The bug scales with the number of files (and meta + tags in the rules).

If you would have really found the bug, you would have fixed it properly. If you'd run the code I've provided 20 days ago, you would see what's wrong.

Thanks for giving me credit the author of https://github.com/VirusTotal/yara-python/issues/170

P.S. Be excellent to each other.

ruppde commented 3 years ago

@plusvic The problem with https://github.com/VirusTotal/yara-python/pull/175/commits/38063a87b4b5bbecefabc4e4a5fa3887c15fca90 is, that it doesn't fix the problem if yara.Rules.match() is used without callback (See e.g. example code in https://github.com/VirusTotal/yara-python/issues/170). It's fixed for callback and without in https://github.com/2d4d/yara-python/commit/785e11c20177b777964376de0e538f5d22835fe8 but that's a bit slower with callback. The best fix performance wise would be to catch it in the beginning like in PR#175 but then you would have the logic redundant in 2 places. So I would propose even more revamping of the code to have the logic only once.

wxsBSD commented 3 years ago

I won’t be engaging in this discussion anymore beyond this reply. Your initial report was difficult for me to reproduce, and I did initially try your demo code. It was only when I couldn’t get it to work that I asked for more information, and even then I couldn’t get it to work. It was at this point that I read the code on the hunch that you probably did find something and maybe I could find it in the code. That is when I found the bug in the code and fixed it. However, I still could not reproduce it with your example so I then built my own.

As for my tweets: I gave you credit (not by name because at the time I didn’t know your name). I have always given credit where it is due. You found the problem but your solutions were not what ended up in the tree, it was my code that fixed it (again, with your help in bringing up the issue and pointing to the rough area in the code). Your initial reports were hard to process because they were lacking exact commands you ran and on which files, which is why I mention that bug reports need to be as thorough as possible to make it easier for people to work with you.

I am happy you brought up this issue. I wish it was easier for me to reproduce with your initial report, but the fact is it wasn’t easy for me to follow along with what you were doing in order to reproduce it. I do appreciate the bug report and the hints as to how to fix it.

I will no longer be responding to this thread. It isn’t productive to do so anymore. I’m sorry you took my words the wrong way.

ruppde commented 3 years ago

@wxsBSD So after I made a detailed bug description in https://github.com/VirusTotal/yara-python/issues/170#issuecomment-783980892 including the 3 loops slowing down the code, you "found the bug"???

Your initial reports were hard to process because they were lacking exact commands you ran and on which files, which is why I mention that bug reports need to be as thorough as possible to make it easier for people to work with you.

How dare you to repeat that I didn't give enough details??? There's 50 lines of demo code in https://github.com/VirusTotal/yara-python/issues/170 and you still don't tell me what's the problem with it? I named the used rules, any random directory of samples shows the effect.

I agree, there's no need for you to answer.

plusvic commented 3 years ago

@2d4d Both @wxsBSD and myself always give credits to people who contribute to this project, rest assured that in this case it won't be different. I don't think @wxsBSD's intention was to get credit from your work, I think he was trying to make your fix simpler.

However, you're right about #175 not fixing the issue completely for the non-callback case, which is the most common one. So I'm introducing 8b7890b489e4e0d63f6b8655341070f3b6c07cd8, which should solve it. Please try with this latest commit and tell me if everything goes fine.

ruppde commented 3 years ago

@plusvic Still doesn't speedup the case without callback, callback == NULL was missing:

  case CALLBACK_MSG_RULE_NOT_MATCHING:
    // If the rule doesn't match and the user has not specified that they want
    // to see non-matches then there's nothing more to do and we return
    // CALLBACK_CONTINUE, keep executing the function if otherwise.

    if (((which & CALLBACK_NON_MATCHES) != CALLBACK_NON_MATCHES) || callback == NULL )
      return CALLBACK_CONTINUE;
  }

Still don't like that way of fixing it because the logic is there twice. Would be better the other way around: If something was found, then do all the work.

Not giving credit was a side problem of @wxsBSD, how can he blame me for not giving enough information (please see https://github.com/VirusTotal/yara-python/issues/170, there's demo code, the rules are named, I name the exact lines of code that are the problem, ...) then claim that he "found the bug" despite my lousy bug report? This was "textbook claiming other peoples merits", I can look up a good book on "soft skills" about it if you like. deny the problem, delay, offer to use his tool, delay, claiming to have found the problem, hurray.

wxsBSD commented 3 years ago

I realize I said I wouldn't be responding anymore but hopefully this can be kept polite between us and I can give you more context to help us work better together in the future...

how can he blame me for not giving enough information

As I re-read #170 I believe there is missing context around exactly how you triggered it, primarily around how many files and the size of the files, from the report that made it difficult to reproduce. When submitting issues it is helpful to give as much information as possible, and especially so for performance issues where duplicating the issue can be difficult. It was hard for me to trigger this because I didn't know how many files and what size files to use when running your demo code (if that information even mattered, I suspected it didn't but without that information I have to do more work to rule it out). I believe in my tests I was using the rules you mention (and your exact code) with 200 files. I even went so far as to get 50 files that did match some of the rules on the off chance it was related to a mix of matching and non-matching files. These are all examples of things that were left out of your initial report that made it difficult for me to reproduce this initially.

It is also helpful to reduce demos down to the minimum possible. Testing with a single rule with lots of metadata (as you mention later in the thread) was much easier, especially when you shared how many files should be able to trigger it. Thank you for being patient with me and providing that smaller reproducer.

Reproduction was difficult because of the missing context, but you did pinpoint what was going on (creating lots of python objects unnecessarily). I credited you with finding the issue (bug, if you want to call it that) but your fixes were not right, even by your own admission, so I can't credit you with the final fix that went in the tree. I did want to give credit to you for finding the problem (and I did that). I would never claim credit for finding the underlying performance problem (you did that in #170) but I can give myself credit for the simpler fix, even if it was based upon your hints. It seems like you're stuck on the word "bug" and I apologize for a poor word choice on my part. As I said before, my intention was to give you credit for opening the issue and finding the problem but to also talk about the work I did in trying to reproduce it and eventually landing a simpler fix than the ones you proposed, even if my fix missed what is arguably the most important case (no callback provided). Sorry for missing that but it does bring up another point...

This doesn't fix yara.Rules.match()

This is an example of not very helpful statements. It sounds like you had more context as to WHY this wasn't a full fix and instead of giving helpful context you just hint at a problem and leave it up to someone else to try and find it. Even just describing the problem as "this doesn't fix calling yara.Rules.match() with no callback" would have been more helpful. As someone who tries to triage various problems when I have time at night I urge you to provide as much context up front when opening issues or making comments like the above. It will help reduce the time and effort required to transfer the context you have into whoever ends up working with you to address your issue.

As I said earlier, I am very happy that you 1) found this issue and 2) brought it up. Hopefully we can put this behind us and work together to improve YARA and our interactions here.

ruppde commented 3 years ago

@wxsBSD

ok, if you insist on continuing to blame me, I'm more than happy to continue this.

These are all examples of things that were left out of your initial report that made it difficult for me to reproduce this initially.

  1. with "all examples of things" you mean exactly 1 thing: not naming the used sample files.
  2. just used my demo code from #170 with the rules named in there on 50 files with 1 MB size (the size you used when you "finally found the bug" and above average) and the results are:
    One more round to rule out caching effects:
    scan took: 0.31 - matches: 0 with "rules stripped_meta" 
    scan took: 0.51 - matches: 0 with "rules original" 

    wow, it's 80% faster. point made: naming the samples doesn't make a difference here.

  3. apart from that, in my demo code the path to the samples is a command line argument and it's pretty easy to run it on 3 different directories. "hmm, not much difference on ~/Pictures but on /bin it's 3 times faster, wow."
  4. now why didn't you post the results you got?

I would never claim credit for finding the underlying performance problem. but I can give myself credit for the simpler fix ...

here's your tweet again:

Credit where due: the author of https://github.com/VirusTotal/yara-python/issues/170 spotted this. The report was not easy to reproduce but I did find the bug quickly. This is why when you are reporting bugs be as explicit as you can and provide EXACT commands and input files.

https://twitter.com/wxs/status/1370217850212847618

plusvic commented 3 years ago

@2d4d

@plusvic Still doesn't speedup the case without callback, callback == NULL was missing.

Oh my! I need some coffee. Added in 5c0aec3f65c660aa3e0dd7075f45551e722e5cd3.