Open mfzl opened 6 years ago
No is you mean proper aliasing.
You can generate hash collison and find a key that collides with another.
Closing due to stale.
I'd also be interested in this
@hobbs If you are interested in this feature we need to think about a solution. Can you propose something?
@janisz I did a small investigation today at morning. I was interested if it can be done via hash function. A small impl is here: https://gist.github.com/cristaloleg/fad647add6226659876fc716796f2a3c
but it will be much simpler to add to the each shard a new map to store aliases.
And additional exported method SetAlias
(both cache and setAlias
for shard).
I'd like to start with defining the API. How will user add an alias? What is a use case for this feature and what are the requirements? Then we can think if this could be done by extending hash function with alias map or do we need to add aliases map as an independent component.
I'm realizing that my requirements are more complicated than I originally thought. I'm actually looking for an alias key that points to an array of values, essentially introducing the concept of a many to many relationship. It's more of an index for specific fields in my value set.
Anyways, I think my requirements are out of scope, sorry waking up this ticket :)
@hobbs would you be able to expand on your requirements either in this issue or another? I'd be curious to see where you see value in many-many and what you think that would add to this project.
Not my exact scenario, but an example:
So it could look something like:
// Third argument takes in additional tags
cache.Set(url, pageResponse, []string{ 'article-layout', 'article-id-123' })
cache.Set(url2, pageResponse2, []string{ 'article-layout', 'article-id-456' })
...
cache.RemoveByTag('article-layout')
It's similar to what some CDNs do with surrogate keys, or cache-tags, etc: Fastly, Part 1: https://www.fastly.com/blog/surrogate-keys-part-1 Fastly, part 2: https://www.fastly.com/blog/surrogate-keys-part-2 Cloudflare: https://support.cloudflare.com/hc/en-us/articles/206596608-How-to-Purge-Cache-Using-Cache-Tags-Enterprise-only-
Use case I have is to allow cache lookup with two different unique identifiers. A common example is to lookup cached user information by both ID and username/email.
@hobbs Do you need to query by tag? @faxal Is the second ID added after or when the value is inserted?
@janisz the latter, keys will be set when cache entry is inserted.
@hobbs is the use case just removal? How do you see updates being handled? i.e., if you updated an existing record, cache.Set(url, updatedPageResponse, []string{'article-layout', ...}
, what happens if there is now an orphaned tag? Would you require the tags to match when updating an existing record? What about changing tags? How would you see updating the tags on an item without updating the item? Is that a use case?
Currently we only block with mutexes when necessary and abstract that away from users. Updating tags and references, in my mind, is a blocking update as we need to traverse a structure to resolve relationships on-the-fly before we can take any action.
My primary concern here is performance. Managing tags isn't terribly hard, and there are several ways to do it - graphs, slices, trees, structs, etc. The problem is traversing the various structures required for sane tag management range from O(1)
to O(n log(n))
or even O(n^x)
at worst. It's really hard to be performant across millions of items when you have to traverse a graph with O(n^x)
- possible, just very difficult.
Having thought about this all morning (I have OCD), if we were to focus on this use case, which I believe could be important, I think a graph would likely be the easiest way to manage many<->many relationships that you would see with tags. I am also open to better ideas. Realistically, as long as we don't make an attempt to access any cache data when resolving tags, we could optimise certain operations such as loading, storing, and traversing a node. Depending on the current state of compilers (gccgo or something else), we also might be able to heavily optimise at compilation time so long as we are careful about loop implementations and provide this library as a plugin.
With that in mind, the well-known graph search algorithms would be way too slow to traverse 100k objects to resolve tag relationships. Here are some algorithms I've been looking at, which could have a higher grade of performance when traversing graphs. Here are some of my findings.
Algorithm: BC-GA: A Graph Partitioning Algorithm for Parallel Simulation of Internet Applications Abstract: Static task mapping of parallel simulation is studied as a graph partitioning problem (GPP). However, the existing algorithms are primarily designed for partitioning finite element meshes or random graphs which are essentially different from the Internet-like topologies used in the field of large scale network simulation. In this paper, we present a new genetic algorithm, BC-GA, for effectively partitioning Internet-like topologies based on, boundary crossing, a quite different principle inspired by the analysis of characteristic of the Internet topology and its related solutions. All operations of this algorithm are novel, such as pizza-cutting and autogamy propagation. We test this algorithm on a large extent of graphs, including the snapshots of the real AS-level Internet, the topologies produced by the Internet model generator and many traditional benchmark graphs. The experiment shows that our algorithm can outperform traditional approaches on partitioning Internet-like topologies and it is also better than or comparable to those on traditional GPP. The experiment also shows that a genetic algorithm can converge quickly if the domain knowledge is fitly combined. Why: This algorithm is focused on concurrent partition searching. If we are trying to search for a partition (tag), being able to search and resolve all partitions (tags) concurrently would be helpful. This algorithm is also designed for scale, so performance shouldn't be too bad.
Algorithm: An Adaptive Parallel Algorithm for Computing Connected Components Abstract: We present an efficient distributed memory parallel algorithm for computing connected components in undirected graphs based on Shiloach-Vishkin's PRAM approach. We discuss multiple optimization techniques that reduce communication volume as well as load-balance the algorithm. We also note that the efficiency of the parallel graph connectivity algorithm depends on the underlying graph topology. Particularly for short diameter graph components, we observe that parallel Breadth First Search (BFS) method offers better performance. However, running parallel BFS is not efficient for computing large diameter components or large number of small components. To address this challenge, we employ a heuristic that allows the algorithm to quickly predict the type of the network by computing the degree distribution and follow the optimal hybrid route. Using large graphs with diverse topologies from domains including metagenomics, web crawl, social graph and road networks, we show that our hybrid implementation is efficient and scalable for each of the graph types. Our approach achieves a runtime of 215 seconds using 32K cores of Cray XC30 for a metagenomic graph with over 50 billion edges. When compared against the previous state-of-the-art method, we see performance improvements up to 24x. Why: This is likely a more simplistic representation of a concurrent/PRAM approach at a breadth-first search of a graph. If we focus on betweenness (tag correlation), this could be a good algorithm that would let us traverse all the nodes quickly, using a set as a correlation of visited nodes. The performance on this should be okay.
Maybe this should be a proposal...
Sorry for not getting back to you on your questions sooner. I think I would only be using the tags for deletion. And yes, Orphaned tags are definitely a concern. I may try to solve this in a naive way in the meantime, using a separate data structure to let me look up BigCache hashes by tag.
@hobbs yeah, a naive way is best for now if you need something sooner. Not sure when I'll get around to PRAM stuff.
This has been on my mind for awhile, I haven't made any progress on it, though. I still think a graph structure is the right way to go, I'm just thinking through the graph structure.
Is it possible to have two keys point to same cache entry?