Cache implications of predicate construction are subtle

Designing a correct and efficient search predicate requires too much knowledge of how result and attribute caching works. We should improve this, or at least document it.

Filters that do expensive preprocessing should receive only fixed or infrequently-changed arguments. Otherwise, subsequent searches will frequently need to rerun the filter. Drop decisions should be made by separate, inexpensive filters.
The obvious workaround is to emit intermediate results from expensive filters and then do the argument-driven comparisons and make a drop decision in a cheap final filter. However, executing any filter in a searchlet causes the object to be fetched and the RGB filter to be rerun (#16), so this is inefficient.
The most efficient approach is to reduce each value of interest to a score emitted by a filter, then make drop decisions using score thresholds in the searchlet definition. This allows drops to be computed entirely in the result cache.
Any filter which might meaningfully be run more than once in a searchlet (with different arguments) must embed its "filter name" argument in the names of its output attributes, to prevent the attributes from being clobbered. To find these attributes, a downstream filter must receive the upstream filter's name as an argument.
The predicate infrastructure encourages example-based filters to receive their example in the blob argument. This violates the first rule above. In addition, the example evaluation will run once per core per search, and is uncachable.
- At a minimum, the example comparison should be done in a separate filter than the object evaluation, so the latter is cachable.
- If the Diamond application builds searchlets by hand (rather than using a predicate) and the example evaluation is expensive, the application should run a search in two steps: evaluate the example to obtain attribute values, then use these as filter arguments (or score thresholds) during the real search.
- If predicates must be used, the example-evaluation filter can store properties of the example as output attributes of every object it evaluates; these are then read by a comparison filter. This is at least cachable between searches, at least for objects we've already seen. However it still runs once per core in searches with new objects. It also pollutes the object with attributes not really related to that object at all; this can be masked from the client by marking those attributes "omit".

cmusatyalab / opendiamond

Cache implications of predicate construction are subtle #17