StringPropertyFilter.exactMultiple helper for varargs queries

maltek commented 2 years ago

We discussed the bad performance of the varargs queries, so here's my attempt to fix that. But I haven't actually witnessed performance problems so this is mostly based on assumptions.

use index if possible
use a mutable HashSet instead of immutable
Lazily create the HashSet only on the second iteration. Definitely a win when there's zero iterations. Since single-iteration is relatively common with .where and friends it seemed worth it to do delay set creation a bit more and take the potential hit of a single sequential scan.
I thought about not using sets at all when we only have very few values to search for but some small bit of microbenchmarking convinced me that's a bad idea.

mpollmeier commented 2 years ago

I think it's a step forward. Would be interesting to get a rough idea what difference it makes for some typical traversals. It's making the code a bit harder to read, but if you copy part of the PR descriptions into the code that'd help. Especially the part about why we're initially using a Seq and only converting it into a Set eventually.

Separate random thought: how about we replace (or rather extend) the part convert to set if iteration >= 2 with convert to set if needles.size > X?

maltek commented 2 years ago

The index lookup feature excluded, i think the rest of the PR only exists to optimize cases like: cpg.call.where(_.nameExact("a,"b","c")). In this case we never build up the HashSet but rather do the linear probing in needles over and over again. We can only avoid this by creating the set externally and than pass it to a nameExact(set: HashSet). So yeah this make the suboptimal case better but i am not really convinced.

The thing is - code like that exists and I'd rather change a few lines here than a lot of them elsewhere.

(Not creating a set in case of an empty input traversal and using a mutable Set should also be general improvements. It's only really the >= 2 that's specific to this pattern.)

Would be interesting to get a rough idea what difference it makes for some typical traversals.

I don't have benchmarks for that yet. For full sptests runs on a busy machine it's in the noise.

The story behind this PR is that Markus told me not to use the varargs methods because they are slow and bad. That made me look at the implementation and come up with a bunch of improvements... So I don't have a motivating pathological example immediately at hand.

But I can come up with some measurements if you want me to.

It's making the code a bit harder to read, but if you copy part of the PR descriptions into the code that'd help. Especially the part about why we're initially using a Seq and only converting it into a Set eventually.

There is a comment about that in lines 93/94. Should I move that further down, or is it missing something crucial?

Separate random thought: how about we replace (or rather extend) the part convert to set if iteration >= 2 with convert to set if needles.size > X?

That was what I tried to say in the last bullet point - in my microbenchmarking hashset.contains is always faster than seq.contains if you call it often enough irregardless of the number of items they contain. That was also surprising to me - I was benchmarking with the intention to find a reasonable cut-off point. (Should I add a comment saying that this is not worth it?)

mpollmeier commented 2 years ago

Markus told me not to use the varargs methods because they are slow and bad

I would have been surprised if this has any significant impact on the overall sptests performance, but it's good to have anyway :)

Should I add a comment saying that this is not worth it?

Yes please, I think adding that info from the last bullet point would be good to have as a comment. Other than that, I'm 👍🏻 for this.

ml86 commented 2 years ago

Than lets merge this. After all it is in a very encapsulated place and it wont hurt.

ShiftLeftSecurity / overflowdb

StringPropertyFilter.exactMultiple helper for varargs queries #316