CybOXProject / mixbox

A library of common code leveraged by python-cybox, python-maec, and python-stix
BSD 3-Clause "New" or "Revised" License
8 stars 15 forks source link

Added cache module #11

Closed bworrell closed 9 years ago

bworrell commented 9 years ago

This PR provides mixin classes to use for object caching and faster object dereferencing.

Added:

gtback commented 9 years ago

I have mixed feelings about these changes. On one hand, it's definitely something that's useful. On the other, it's not something that you can currently configure (AFAICT, items will automatically get added to the cache regardless of whether it's needed or not). In Python-cybox, the caching behavior allows use of different backends for looking up cached data (such as a database); it seems this is solving a slightly different problem.

The timestamp attribute causes all kinds of problems. I would prefer if the cache had a simple get(key) interface, where key can be either just the ID or a tuple of ID/timestamp. How to deal with mismatched timestamps is another issue I haven't fully thought through. Perhaps a two-level cache?

bworrell commented 9 years ago

@gtback, sorry for the long comment :-\ Caching isn't easy...who knew? ;)

First off, thanks for looking at the PR!

I'm not sure I understand the concerns about adding items to the cache whether they are needed or not. Are you concerned with memory usage? Lookup costs? It's important to note that I'm not caching objects, rather I am caching weak references to objects (which Python will remove when the gc is run), so this shouldn't suffer from the same memory growth issues that the python-cybox cache has.

As for timestamps, I agree--they totally mess up everything. I played around with the two-tier caching approach and really didn't like it. For me it became a pain to manage updating the cache and invalidating the previous cache keys. It also requires more manipulation of the classes that leverage the mixin, because they'll need to trigger cache updates (invalidate previous keys, add new keys) in their timestamp and id property setters (unless we use some __setattr__ magic) which also gets kinda weird in their __init__() methods if they set id and timestamp in them. All of that is why I landed on the simple(ish), "iterate over the cache of objects and inspect their properties" approach, rather than managing dictionaries that are keyed off of id and/or timestamp values.

I also played around with the standardized get() interface that took an id or tuple(id, timestamp) but ultimately didn't really like interacting with it. I ended up feeling confused about the behavior of not specifying a timestamp vs. specifying a None timestamp (should you return all objects with that id vs. only one object if it has a matching id and a timestamp value of None?). That's why I went with a cache_get_all() method and made cache_get() only yield results on matching id/timestamp pairs. I also preferred passing in kwargs over passing in a tuple argument, though I can't really explain why, heh. I guess the behavior just seemed clearer to me with kwargs.

Being able to swap out the backend for a database seems tricky to me, though I think it'd still be possible with the mixin approach (with some refactoring). Do we know of anyone using that feature in python-cybox?

Blargh, caching is a pain! ...but I want it so bad in python-stix :) This PR was a first shot, so (despite what this incredibly long post might seem to suggest) I'm totally psyched to iterate over different approaches and come up with a solid solution!

gtback commented 9 years ago

Ahhh, so I missed a few things... The use case in python-cybox (I forget who was asking about it) was almost exclusively for resolving references and building a more complete object graph during parsing. It essentially assumed IDs were immutable once set (which I think should be a safe assumption that we don't codify anywhere at the moment). While possible for external code to use the cache, that wasn't really the intent. For STIX (particularly with markings), I'm sure there are use cases the solution in python-cybox doesn't cover.

I also didn't realize there were per-class Caches in this code. I'm concerned (perhaps unnecessarily) that this might cause problems with subclasses. An ID (or ID/Timestamp combo) should be globally unique, so a simple get/set interface shouldn't have to be class-specific. The database backend is much easier with this approach than.

I definitely like the weakref approach more than a standard dictionary. I didn't know about that module at the time. I'm not overly concerned about performance. I'm just imagining use case where you want to use a cache for a specific period of time (when parsing a single, large document) and then throwing it away. I suppose the garbage collection will take care of this; I was just imagining a bit more fine-grained control.

You've got much more explicit use cases than I ever did, so I'd like to work off of those (rather than me just making them up). :grinning:

bworrell commented 9 years ago

Sorta kinda off-topic, but I was talking with @ikiril01 and pitched an idea for the get() method that might allow us to get away from the id/tuple and cache_get() vs. cache_get_all() stuff.

Something like...

@classmethod
def get(cls, id, **kwargs):
    values = lookup_values(cls, id)  # some id-lookup method

    if 'timestamp' not in kwargs:
        return values

    ts = kwargs['timestamp']
    for value in values:
        if value.timestamp == ts:
            return value

    return None  # or raise a KeyError or ValueError or CacheMiss or something.

Ignore the classmethod stuff for now, since we might not go with a per-class object cache :)

If a user provides a timestamp (even if it's None), there is a clear intention to retrieve an item with that id/timestamp pair. If no timestamp is provided, pull back everything with that ID. Maybe if there is one, just return the one rather than a list. That might be kinda weird...ANYWAY...just thought this might be neato.

gtback commented 9 years ago

I've been thinking more about this. Instead of having class-specific types for what fields get cached, why not just have a Cacheable class with a class attribute cache_fields as a list of fields to append into a Tuple, or as a _get_cache_value(self) function that returns a value (likely a tuple) to use as a cache key?

Just a couple other implementation thoughts.

landscape-bot commented 9 years ago

Code Health Repository health increased by 8% when pulling 141a211 on cache into 41d533c on master.

bworrell commented 9 years ago

@gtback, hmm I'm not sure I follow :) Any chance you could write up some code that demonstrates it?

bworrell commented 9 years ago

I was thinking about this caching business a bit while stuck in traffic on the way home from work tonight and threw together a new branch: https://github.com/CybOXProject/mixbox/tree/dictcache.

Basically, the cache isn't keyed off of classes anymore and is instead keyed off of ID values and can be futher filtered down by object-specific properties. I was thinking about the Django QuerySet API and how they do get() and filter() and kinda mimicked those methods here.

I dunno. Give it a look and lemme know what you think. I feel like if anything it might be a step in a good direction.

landscape-bot commented 9 years ago

Code Health Repository health increased by 8% when pulling bf0af99 on dictcache into 60782c8 on master.

bworrell commented 9 years ago

@gtback, I just decided to merge the dictcache branch (mentioned above) into cache since I like this better anyway :) It'd be awesome if you could take a look when you get a chance.

landscape-bot commented 9 years ago

Code Health Repository health increased by 8% when pulling de42eb5 on cache into 60782c8 on master.

landscape-bot commented 9 years ago

Code Health Repository health increased by 8% when pulling 51d4e1d on cache into 60782c8 on master.

landscape-bot commented 9 years ago

Code Health Repository health increased by 6% when pulling 51d4e1d on cache into 0f2ec1d on events.