BioJulia / Automa.jl

A julia code generator for regular expressions
Other
188 stars 15 forks source link

Add support for multi-byte transitions #110

Open jakobnissen opened 2 years ago

jakobnissen commented 2 years ago

Automa produces bad code to match string literals. The following machine: compile(re"abracadabra") produces this code:

Machine code ``` .text subq $24, %rsp movq (%rdi), %rdx leaq 8(%rdi), %rcx movq %rcx, 8(%rsp) movq %rdx, 16(%rsp) testq %rdx, %rdx je L267 movb (%rcx), %al cmpb $97, %al jne L277 cmpq $2, %rdx jb L287 movb 1(%rcx), %al cmpb $98, %al jne L297 cmpq $3, %rdx jb L307 movb 2(%rcx), %al cmpb $114, %al jne L314 cmpq $4, %rdx jb L324 movb 3(%rcx), %al cmpb $97, %al jne L331 cmpq $5, %rdx jb L341 movb 4(%rcx), %al cmpb $99, %al jne L348 cmpq $6, %rdx jb L355 movb 5(%rcx), %al cmpb $97, %al jne L362 cmpq $7, %rdx jb L369 movb 6(%rcx), %al cmpb $100, %al jne L376 cmpq $8, %rdx jb L383 movb 7(%rcx), %al cmpb $97, %al jne L390 cmpq $9, %rdx jb L397 movb 16(%rdi), %al cmpb $98, %al jne L404 cmpq $10, %rdx jb L411 movb 9(%rcx), %al cmpb $114, %al jne L418 cmpq $11, %rdx jb L425 movb 10(%rcx), %al cmpb $97, %al jne L462 movb $1, %al addq $24, %rsp retq L267: movl $1, %ecx jmp L430 L277: movl $1, %esi jmp L477 L287: movl $2, %ecx jmp L430 L297: movl $2, %esi jmp L477 L307: movl $3, %ecx jmp L430 L314: movl $3, %esi jmp L477 L324: movl $4, %ecx jmp L430 L331: movl $4, %esi jmp L477 L341: movl $5, %ecx jmp L430 L348: movl $5, %esi jmp L477 L355: movl $6, %ecx jmp L430 L362: movl $6, %esi jmp L477 L369: movl $7, %ecx jmp L430 L376: movl $7, %esi jmp L477 L383: movl $8, %ecx jmp L430 L390: movl $8, %esi jmp L477 L397: movl $9, %ecx jmp L430 L404: movl $9, %esi jmp L477 L425: movl $11, %ecx L430: movabsq $throw_input_error, %rax movabsq $139695384118736, %rdi # imm = 0x7F0D5DBF45D0 leaq 8(%rsp), %rdx movq %rcx, %rsi callq *%rax ud2 L462: movl $11, %esi jmp L477 L469: movb 11(%rcx), %al movl $12, %esi L477: movzbl %al, %edx movabsq $throw_input_error, %rax movabsq $139695384118736, %rdi # imm = 0x7F0D5DBF45D0 leaq 8(%rsp), %rcx movq %rsi, %r8 callq *%rax ud2 ```

Which is almost funny in how bad it is. It's not a big deal for FASTA/SAM inputs, because it has no long strings to match. But we can do better, literal string matching is not a difficult algorithm.

The hard part is: What about when you read from an IO? Then the buffer could run out in the middle of the string, in which case the machine should be able to match e.g. "abra", then reload the buffer, then match "cadabra". I'm not sure how to handle that.

Putting that issue aside, I propose implementing this deeply, perhaps at every level of Automa, from the RE objects where catting string/char literals should result in just a literal string, to NFAs/DFAs, where literal strings should be a single edge, to Machines.

It's a larger undertaking but in a sense straightforward, and it could be fun.

kescobo commented 2 years ago

and it could be fun.

Famous last words :laughing:

I don't have any idea what most of this means, but it sounds awesome :-P. My main question is: would this substantially improve speed, correctness, usability, or something else that justifies the effort? I mean, if it's fun, by all means do it even if it's just an intellectual exercise to improve "elegance" without other concrete benefit.

jakobnissen commented 2 years ago

It would improve compilation speed and speed of generated code (for string literals), and also make compound regex, nfas dfas and machines have fewer states and so be easier to plot and debug

kescobo commented 2 years ago

All of that sounds great! Go forth :stuck_out_tongue: