emustudio / edigen

Emulator Disassembler Generator
GNU General Public License v3.0
4 stars 0 forks source link

Allow to force definition of subrule pattern manually #14

Closed vbmacher closed 12 years ago

vbmacher commented 12 years ago

Following our last email discussion from 7.9. I was still facing to several problems with rules ambiguity, good definition of disassembler formats and it looked horrible unclear when reading the specification file - due various "workarounds" needed for edigen satisfaction.

Therefore, I got an idea how it can be improved, at least temporarily - you can possibly change it later. If some rules are going to be ambiguous, you can explicitly define so called "pre-pattern" - manual forward definition of subrule pattern, such as this:

instr =
  "rlc": 000111  |  "rrc": 001111  | "ral": 010111  |  "rar": 011111  | "nop": 000000                |
  "ldax": 0 ldax_bc_de[01](2) 010 | "ldax" : 0 ldax_bc_de[11](2) 010 ;

ldax_bc_de =
  "BC": 01 | "DE": 11;

The [] denotes pattern definition for a subrule. Edigen will "trust" to the pattern ultimately and this is a threat, because the "pre-pattern" does not have to correspond to real pattern of the subrule. Therefore, it is not very good solution and it should be solved later.

It is up to you either to accept it, or not. Or, better we can discuss this and I can potentionally fix this according to some consensus...

sulir commented 12 years ago

I don't see any reason why the example shouldn't be written like this:

instr =
  "rlc": 000111  |  "rrc": 001111  | "ral": 010111  |  "rar": 011111  | "nop": 000000 |
  "ldax": 0 ldax_reg(1) 1010;

ldax_reg =
  "BC": 01 | "DE": 11;

It's clear and it directly copies the manual.

If this was only a figurative example and you have a problem with other instructions, please send me the whole input file to see how can we improve that (either by modifying Edigen or the input file).

vbmacher commented 12 years ago

OK, I think the following example would be enough. In 8080, hlt and mov instructions are very similar, because hlt is in fact mov M,M - which is illegal in 8080.

In edigen specification, I would be happy to write:

root = 
  "hlt": 01110110 |
  "mov": 01 src_reg_mless(3) dst_reg | "mov": 01 src_reg(3) dst_reg_mless;

src_reg, dst_reg =
 "B": 000 | "C": 001 | "D": 010 | "E": 011 | "H": 100 | "L": 101 | "M": 110 | "A": 111;

src_reg_mless, dst_reg_mless =
 "B": 000 | "C": 001 | "D": 010 | "E": 011 | "H": 100 | "L": 101 | "A": 111;

but I can't, because there is ambiguity. I can fix this by saying:

root = 
  "hlt": 01110110 |
  "mov": 01 src_reg[0](3) dst_reg | "mov": 01 src_reg[10](3) dst_reg | "mov": 01 src_reg[111](3) dst_reg;

src_reg, dst_reg =
 "B": 000 | "C": 001 | "D": 010 | "E": 011 | "H": 100 | "L": 101 | "M": 110 | "A": 111;

I know this is not very "happy" solution, but it was simple to implement. How would you solve it? - either in the file or in the edigen?

sulir commented 12 years ago

Maybe we could implement a "high priority tag" (exclamation mark), like this:

instr = 
 ! "hlt": 01110110 |
 "mov": 01 src_reg(3) dst_reg;

src_reg, dst_reg =
 "B": 000 | "C": 001 | "D": 010 | "E": 011 | "H": 100 | "L": 101 | "M": 110 | "A": 111;

The marked variant would be guaranteed to be evaluated before all other variants and excluded from ambiguity check.

A more systematic approach - defining the priority manually and disabling the ambiguity check completely - would completely ruin the optimization (GroupVisitor), so it is not an option.

Would the "exclamation mark" suffice?

vbmacher commented 12 years ago

We can use the exclamation mark and it will solve my problem. But I am not sure if the problem would be solved generally. What if you would have something like this:

"hlt" : 01110110
"hls": 01111011
"mov": 01 dst_reg(3) src_reg

Ofcourse this grammar is artifical, but would exclamation mark help for hlt? Or would we be able to use more of them? Like this:

instr =
  ! "hlt" : 01110110 |
  ! "hls": 01111011 |
  "mov": 01 dst_reg(3) src_reg |
  ... ;

This should guarantee that hlt and hls will be evaluated before other variants, but in undefined order. But - would it have better or worse impact on optimization when compared to my previous solution? In either case, this "new" solution is less error-prone.

I think it would be good to have some theoretical proof that all "binary grammars" belong to some known language set, for example LL(k) and if yes, make sure it is possible to write such language grammar in edigen. I don't know if it is possible to prove it or whether such proof exists at all. But definitely it would be the most general and precise approach.

sulir commented 12 years ago

One high-priority variant in a rule would be quite easy and it wouldn't affect the speed significantly. The variants are grouped by masks. If we moved one variant to the beginning of a subtree, its mask(s) and pattern(s) would become the first one, so it (they) would be evaluated at the beginning.

Multiple exclamation marks would be more difficult to implement and less efficient, because they can have different masks.

So, I will merge your solution.