Open sulir opened 12 years ago
This reminds me a prefix-tree (Trie) used in text searching. I imagine decoder should be working like this:
The trie node has the following:
explore(input)
annotation
= decoded instructionon an input, the decoding algorithm would match longest prefix in the trie. If it's matched and the node contains an annotation
, we have the instruction. If it does not have any annotation, we're expecting more input (cannot decide).
if the input is longer than the longest prefix in the trie (we're at the end), we use function explore
called with the unmatched input. This function will expand the trie even more from the node where we call it. If it's not possible, then we throw runtime exception (unknown instruction).
But I think we don't need to tie ourselfs with cache approach (I mean dynamically expand the tree). Instead we can build the whole tree in advance from the spec file and use it for searching during decoding. So decoding would mean just to search for the longest prefix of the input and emit instructions.
Actually, maybe even better is to use DFA (determinite finite automaton) used in parsing, or instead of generating Java code, generate parser grammar for some parser generator (JavaCC or ANTLR) which can do the job..
Hm, all the processes I described above seem too complicated - in implementation and complexity. I think the caching advantage should mean much faster decoding which means we need to know the input size in advance. I'm returning back to the idea originally suggested by @sulir .
Can we assume one instruction has size limit? I mean - can we always guarantee that the byte sequence of an instruction is final? Let's assume yes.
Then, we can have several caches based on expected instruction size: Map[Integer, Map[Array[Byte], DecodedInstruction]]
.
By the assumption above, we know all possible instruction sizes. Number of instruction sizes is also final (simply because the specification file doesn't allow infinite number of rules). With this assumptions, the cache can be prepared in advance. If we want to save memory, we can also limit the cache to cache only a certain instruction sizes.
The algorithm can work as follows:
When considering algorithmic complexity, the cache lookup is O(k)
where k
is the number of instruction sizes (when assuming that lookup of Array[Byte]
in a hashmap is O(1)).
Decoding itself has complexity O(n * h * m)
where n
is number of rules, h
is the number of masks in a rule and n
is number of patterns in a mask.
So the final complexity in the worst-case using the cache solution can be up to O(k) + O(n * h * m)
.
I think we're getting into a problem which the JVM HotSpot needs to deal with. The JVM Hotspot is interpreting instructions until a hit rate of one instruction (or a block) hits a threshold. When this threshold is hit, the block is translated into native implementation and it is cached. From now on, every time this address is hit, the cached value is used.
If we get inspired from this knowledge, then we can improve our decoding algorithm based on the memory location of an instruction. And we don't need to know instruction sizes at all. The only thing we need to take care of is to monitor if any memory location changed which hits instructions in the cache.
The cache can be in form: Map[Integer, DecodedInstruction]
. The algorithm can work as follows:
Requirement is that we need to implement a memory listener and invalidate cache locations which changed. In the implementation of the map we can use a LRU eviction algorithm (e.g. https://github.com/ben-manes/caffeine).
The algorithmic complexity of the above is not worse than the current decoding algorithm in the worst-case. However, in the best case it is O(1).
The cache should work like this: Every time an instruction is decoded, it is placed into the cache of some capacity. Next time, if the supplied bytes are exactly the same, the decoded instruction is returned directly from the cache, without performing the decoding process.
There is one problem - the size of the instruction is unknown before decoding. So the cache should be a combination of a tree and hash-maps.
Something similar could be also done for the disassembler.
If the insertion and selection were fast enough, this would significantly improve the emulator speed.