Regex error when reading asm files

whusym commented 6 years ago

I encountered this error when reading regex. We should change the regex so we can tokenize the opcode.

Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$2: (string) => array<string>)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more
Caused by: java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 53
(?<=\.([a-z]+):([0-9A-F]+)((?:\s[0-9A-F]{2})+)\s+)([a-z]+)
                                                ^
    at java.util.regex.Pattern.error(Pattern.java:1955)
    at java.util.regex.Pattern.group0(Pattern.java:2863)
    at java.util.regex.Pattern.sequence(Pattern.java:2051)
    at java.util.regex.Pattern.expr(Pattern.java:1996)
    at java.util.regex.Pattern.compile(Pattern.java:1696)
    at java.util.regex.Pattern.<init>(Pattern.java:1351)
    at java.util.regex.Pattern.compile(Pattern.java:1028)
    at scala.util.matching.Regex.<init>(Regex.scala:191)
    at scala.collection.immutable.StringLike$class.r(StringLike.scala:255)
    at scala.collection.immutable.StringOps.r(StringOps.scala:29)
    at scala.collection.immutable.StringLike$class.r(StringLike.scala:244)
    at scala.collection.immutable.StringOps.r(StringOps.scala:29)
    at org.apache.spark.ml.feature.RegexTokenizer$$anonfun$createTransformFunc$2.apply(Tokenizer.scala:141)
    at org.apache.spark.ml.feature.RegexTokenizer$$anonfun$createTransformFunc$2.apply(Tokenizer.scala:140)
    ... 12 more

cbarrick commented 6 years ago

This happens because the regex uses a look-behind to find the opcode, but the look-behind is too complex for Java's regex engine.

The solution might be to make the regex the list of opcodes separated by |.

The best way to do this would be to scan over a handful of asm files, collect the opcodes into a set, and generate the regex. This can be done offline, locally, in pure Python. Then we just replace the regex in elizabeth.preprocess with the new one.

whusym commented 6 years ago

What do you think of this one?

(?![0-9A-F][0-9A-F]\s+)[a-z]+[^\s|\,|\]]

this extract all lower-case non-numeric words (opcodes, as I understand it). It will still extract db, dw, and dd, which we take to be about database and do not matter? To exclude them, we can make a stopword list with these three elements in and use the stopword transformer.

I will give it a try.

cbarrick commented 6 years ago

I'm pretty sure this won't work.

There are a lot more lower-case non-numeric words than just the opcodes. We have the segment identifier at the beginning of each line, comments inserted by the decompiler, and arguments. Plus I don't think this matches the opcodes anyway. It will pull in the last byte preceding the opcode, thus duplicating opcodes which should be considered the same.

cbarrick commented 6 years ago

Regex for all opcodes in the small set

sti|pmulhw|cmpsb|dec|setnle|paddusw|ins|psadbw|rdtsc|shld|xchg|daa|psubsb|fldln|unk|cmovle|fyl|out|movdq|fcos|cmpxchg|loope|setnb|setz|iret|das|ror|f|shrd|prefetcht|fist|fbld|fisubr|mulpd|psubusw|movd|pushf|jl|psrlq|jnz|movlps|pcmpgtb|stosb|pmullw|tbyte|cmova|pop|jge|movlpd|psrlw|fiadd|fsubp|cpuid|fxch|jmp|jnp|cy|movdqa|pavgusb|rcl|mov|hlt|inc|pandn|bsf|movdqu|stmxcsr|frndint|fucompp|fnstenv|wrmsr|jp|cli|lodsw|riid|mul|int|sar|setl|psrld|cmovb|pmulhuw|clc|psrldq|pmaddwd|scasb|movapd|outsw|movq|setbe|rcr|aad|bswap|fidivr|fisttp|xor|fcom|movaps|pusha|frstor|pshufhw|packuswb|outsd|fst|psubsw|byte|scasd|movntdq|andpd|rep|fsub|stc|fbstp|setnz|prefetchnta|jle|fsubrp|fndisi|fnclex|cmc|fmulp|psrad|vmovdqu|aam|stru|fcmovnbe|movntq|unpckhpd|paddb|psllw|div|fmul|fnstcw|mulsd|pcmpeqw|fxsave|femms|fcomip|fld|adc|pavgb|punpckhbw|fldz|ldmxcsr|jbe|bound|in|cld|psubw|a|pminsw|fldlg|paddsb|pxor|seto|paddsw|punpckhdq|lea|ja|icebp|cmpps|fistp|sfence|fsin|xbegin|fcomi|punpckhwd|cmps|shr|lodsb|wait|emms|setb|setns|fucomip|movzx|fxam|orps|jo|ht|std|h|sahf|fsubr|fucomp|cwde|jns|fnstsw|pslld|rc|ficomp|pextrw|insb|packssdw|cmovg|retn|cmovl|popf|ficom|cbw|faddp|fldl|fimul|connect|push|pshufd|cmovnz|movsx|psubd|cmovnb|movsw|cmovns|dd|lahf|punpcklqdq|fscale|dw|cmovbe|rol|psz|aas|fstcw|pcmpeqd|lods|paddusb|cmpsd|pshuflw|packsswb|paddw|lodsd|lock|cmovge|sbb|xlat|rclsid|pmaxub|enter|les|pminub|btc|sets|bt|off|pslldq|punpckhqdq|fucom|pshufw|arpl|vpunpckhqdq|extrn|fcmovnu|shl|into|pand|paddd|fabs|psraw|fidiv|bsr|fneni|dbl|popa|outsb|movntps|fucomi|leave|scas|fadd|jecxz|movs|lds|fild|fstsw|fcmovne|align|recv|fcomp|bts|subps|stosw|imul|jz|punpckldq|asc|cmpsw|fdiv|movsb|setnbe|psubb|pcmpgtd|word|add|fcmovbe|lp|jb|sal|jno|subsd|cmovz|psubusb|movsd|js|test|fcompp|fldcw|fstp|paddq|fldenv|neg|flt|outs|fpatan|idiv|and|call|orpd|fdivp|insd|por|aaa|prefetch|psllq|cmp|hnt|setalc|dword|pcmpeqb|fcmove|pcmpgtw|sldt|stosd|addsd|fdivr|db|cvttsd|addpd|ffreep|cdq|pavgw|pmaxsw|accept|punpcklwd|nop|movups|loop|sub|loopne|not|fsqrt|sz|retf|cmovs|fnsave|cmpneqpd|fchs|fprem|unicode|setnl|repe|jnb|repne|fdivrp|fisub|setle|sysexit|fninit|jg|punpcklbw|or

Generated with:

>>> import re
>>> from pathlib import Path
>>> data = Path('./data/asm')
>>> pat = re.compile('\.([a-z]+):([0-9A-F]+)(\s[0-9A-F]{2})+\s+([a-z]+)')
>>> opcodes = set()
>>> for p in data.glob('*.asm'):
...   with p.open() as f:
...     try:
...       for line in f:
...         m = pat.match(line)
...         if m: opcodes.add(m[4])
...     except UnicodeDecodeError:
...         continue
...
>>> '|'.join(opcodes)

whusym commented 6 years ago

I guess I didn't quite understand what opcode is....I was trying to get anything other than bytes. But now I have a better idea. Thanks!

zachdj commented 6 years ago

I think we can close this now that we have working regex for the section titles and the opcodes

cbarrick commented 6 years ago

We should not close until the fix actually lands in master

cbarrick commented 6 years ago

The new regex landed in #27

dsp-uga / elizabeth

Regex error when reading asm files #23