Closed whusym closed 6 years ago
This happens because the regex uses a look-behind to find the opcode, but the look-behind is too complex for Java's regex engine.
The solution might be to make the regex the list of opcodes separated by |
.
The best way to do this would be to scan over a handful of asm files, collect the opcodes into a set, and generate the regex. This can be done offline, locally, in pure Python. Then we just replace the regex in elizabeth.preprocess
with the new one.
What do you think of this one?
(?![0-9A-F][0-9A-F]\s+)[a-z]+[^\s|\,|\]]
this extract all lower-case non-numeric words (opcodes, as I understand it). It will still extract db
, dw
, and dd
, which we take to be about database and do not matter? To exclude them, we can make a stopword list with these three elements in and use the stopword transformer.
I will give it a try.
I'm pretty sure this won't work.
There are a lot more lower-case non-numeric words than just the opcodes. We have the segment identifier at the beginning of each line, comments inserted by the decompiler, and arguments. Plus I don't think this matches the opcodes anyway. It will pull in the last byte preceding the opcode, thus duplicating opcodes which should be considered the same.
Regex for all opcodes in the small set
sti|pmulhw|cmpsb|dec|setnle|paddusw|ins|psadbw|rdtsc|shld|xchg|daa|psubsb|fldln|unk|cmovle|fyl|out|movdq|fcos|cmpxchg|loope|setnb|setz|iret|das|ror|f|shrd|prefetcht|fist|fbld|fisubr|mulpd|psubusw|movd|pushf|jl|psrlq|jnz|movlps|pcmpgtb|stosb|pmullw|tbyte|cmova|pop|jge|movlpd|psrlw|fiadd|fsubp|cpuid|fxch|jmp|jnp|cy|movdqa|pavgusb|rcl|mov|hlt|inc|pandn|bsf|movdqu|stmxcsr|frndint|fucompp|fnstenv|wrmsr|jp|cli|lodsw|riid|mul|int|sar|setl|psrld|cmovb|pmulhuw|clc|psrldq|pmaddwd|scasb|movapd|outsw|movq|setbe|rcr|aad|bswap|fidivr|fisttp|xor|fcom|movaps|pusha|frstor|pshufhw|packuswb|outsd|fst|psubsw|byte|scasd|movntdq|andpd|rep|fsub|stc|fbstp|setnz|prefetchnta|jle|fsubrp|fndisi|fnclex|cmc|fmulp|psrad|vmovdqu|aam|stru|fcmovnbe|movntq|unpckhpd|paddb|psllw|div|fmul|fnstcw|mulsd|pcmpeqw|fxsave|femms|fcomip|fld|adc|pavgb|punpckhbw|fldz|ldmxcsr|jbe|bound|in|cld|psubw|a|pminsw|fldlg|paddsb|pxor|seto|paddsw|punpckhdq|lea|ja|icebp|cmpps|fistp|sfence|fsin|xbegin|fcomi|punpckhwd|cmps|shr|lodsb|wait|emms|setb|setns|fucomip|movzx|fxam|orps|jo|ht|std|h|sahf|fsubr|fucomp|cwde|jns|fnstsw|pslld|rc|ficomp|pextrw|insb|packssdw|cmovg|retn|cmovl|popf|ficom|cbw|faddp|fldl|fimul|connect|push|pshufd|cmovnz|movsx|psubd|cmovnb|movsw|cmovns|dd|lahf|punpcklqdq|fscale|dw|cmovbe|rol|psz|aas|fstcw|pcmpeqd|lods|paddusb|cmpsd|pshuflw|packsswb|paddw|lodsd|lock|cmovge|sbb|xlat|rclsid|pmaxub|enter|les|pminub|btc|sets|bt|off|pslldq|punpckhqdq|fucom|pshufw|arpl|vpunpckhqdq|extrn|fcmovnu|shl|into|pand|paddd|fabs|psraw|fidiv|bsr|fneni|dbl|popa|outsb|movntps|fucomi|leave|scas|fadd|jecxz|movs|lds|fild|fstsw|fcmovne|align|recv|fcomp|bts|subps|stosw|imul|jz|punpckldq|asc|cmpsw|fdiv|movsb|setnbe|psubb|pcmpgtd|word|add|fcmovbe|lp|jb|sal|jno|subsd|cmovz|psubusb|movsd|js|test|fcompp|fldcw|fstp|paddq|fldenv|neg|flt|outs|fpatan|idiv|and|call|orpd|fdivp|insd|por|aaa|prefetch|psllq|cmp|hnt|setalc|dword|pcmpeqb|fcmove|pcmpgtw|sldt|stosd|addsd|fdivr|db|cvttsd|addpd|ffreep|cdq|pavgw|pmaxsw|accept|punpcklwd|nop|movups|loop|sub|loopne|not|fsqrt|sz|retf|cmovs|fnsave|cmpneqpd|fchs|fprem|unicode|setnl|repe|jnb|repne|fdivrp|fisub|setle|sysexit|fninit|jg|punpcklbw|or
Generated with:
>>> import re
>>> from pathlib import Path
>>> data = Path('./data/asm')
>>> pat = re.compile('\.([a-z]+):([0-9A-F]+)(\s[0-9A-F]{2})+\s+([a-z]+)')
>>> opcodes = set()
>>> for p in data.glob('*.asm'):
... with p.open() as f:
... try:
... for line in f:
... m = pat.match(line)
... if m: opcodes.add(m[4])
... except UnicodeDecodeError:
... continue
...
>>> '|'.join(opcodes)
I guess I didn't quite understand what opcode is....I was trying to get anything other than bytes. But now I have a better idea. Thanks!
I think we can close this now that we have working regex for the section titles and the opcodes
We should not close until the fix actually lands in master
The new regex landed in #27
I encountered this error when reading regex. We should change the regex so we can tokenize the opcode.