ShinobuAmasaki / forgex

A Regular Expression Engine written entirely in Fortran
https://shinobuamasaki.github.io/forgex/
MIT License
21 stars 3 forks source link
fortran fortran-package-manager regex regular-expression

Fortran Regular Expression

Forgex—Fortran Regular Expression—is a regular expression engine written entirely in Fortran.

This project is managed by Fortran Package Manager (FPM), providing basic processing of regular expression, and as a freely available under the MIT license. The engine's core algorithm uses a deterministic finite automaton (DFA) approach. This choice have been focused on runtime performance.

Features

Metacharacter

Character class

Note that inverted character class does not match the control characters.

Range of repetition

Anchor

Shorthand

Documentation

The documentation is available in English and Japanese at https://shinobuamasaki.github.io/forgex.

Usage

Build

Operation has been confirmed with the following compilers:

It is assumed that you will use the Fortran Package Manager(fpm).

First of all, add the following to your project's fpm.toml:

[dependencies]
forgex = {git = "https://github.com/shinobuamasaki/forgex"}

APIs

When you write use forgex at the header on your program, .in. and .match. operators, regex subroutine, and regex_f function are introduced.

program main
   use :: forgex
   implicit none

The .in. operator returns true if the pattern is contained in the string.

block
   character(:), allocatable :: pattern, str

   pattern = 'foo(bar|baz)'
   str = "foobarbaz"
   print *, pattern .in. str  ! T

   str = "foofoo"
   print *, pattern .in. str  ! F
end block

The .match. operator returns true if the pattern exactly matches the string.

block
   character(:), allocatable :: pattern, str
   pattern = '\d{3}-\d{4}'
   str = '100-0001'
   print *, pattern .match. str  ! T

   str = '1234567'
   print *, pattern .match. str  ! F
end block

The regex is a subroutine that returns the substring of a string that matches a pattern as its arguments.

block
   character(:), allocatable :: pattern, str, res
   integer :: length

   pattern = 'foo(bar|baz)'
   str = 'foobarbaz'

   call regex(pattern, str, res)
   print *, res                              ! foobar

   ! call regex(pattern, str, res, length)
      ! the value 6 stored in optional `length` variable.

end block

By using the from/to arugments, you can extract substrings from the given string.

block
   character(:), allocatable :: pattern, str, res
   integer :: from, to

   pattern = '[d-f]{3}'
   str = 'abcdefghi'

   call regex(pattern, str, res, from=from, to=to)
   print *, res                   ! def

   ! The `from` and `to` variables store the indices of the start and end points
   ! of the matched part of the string `str`, respectively.

   ! Cut out before the matched part.
   print *, str(1:from-1)        ! abc

   ! Cut out the matched part that equivalent to the result of the `regex` function.
   print *, str(from:to)         ! def

   ! Cut out after the matched part.
   print *, str(to+1:len(str))   ! ghi

end block

The interface of regex subroutine is following:

interface regex
   module procedure :: subroutine__regex
end interface

pure subroutine subroutine__regex(pattern, text, res, length, from, to)
   implicit none
   character(*),              intent(in)    :: pattern, text
   character(:), allocatable, intent(inout) :: res
   integer,      optional,    intent(inout) :: length, from, to

If you want to the matched character string as the return value of a function, consider using regex_f defined in the forgex module.

interface regex_f
   module procedure :: function__regex
end interface regex_f

pure function function__regex(pattern, text) result(res)
   implicit none
   character(*), intent(in)  :: pattern, text
   character(:), allocatable :: res

UTF-8 String matching

UTF-8 string can be matched using regular expression patterns just like ASCII strings. The following example demonstrates matching Chinese characters. In this example, the length variable stores the byte length, and in this case there 10 3-byte characters, so the length is 30.

block
   character(:), allocatable :: pattern, str
   integer :: length

   pattern = "夢.{1,7}胡蝶"
   str = "昔者莊周夢爲胡蝶 栩栩然胡蝶也"

   print *, pattern .in. str            ! T
   call regex(pattern, str, res, length)
   print *, res                         ! 夢爲胡蝶 栩栩然胡蝶
   print *, length                      ! 30 (is 3-byte * 10 characters)

end block

Command Line Interface Tool

Version 3.2 introduces a command line tool that is called forgex-cli and uses the Forgex engine for debugging, testing, and benchmarking regex matches. It performs matching with commands such as the one shown in below, and outputs the results directly to standard output.

% forgex-cli find match lazy-dfa '([a-z]*g+)n?' .match. 'assign'

             pattern: ([a-z]*g+)n?
                text: 'assign'
          parse time:        64.0μs
extract literal time:         6.9μs
         runs engine:         T
    compile nfa time:        47.8μs
 dfa initialize time:         5.4μs
         search time:       704.9μs
     matching result:         T
  memory (estimated):     10324

========== Thompson NFA ===========
state    1: (?, 5)
state    2: <Accepted>
state    3: (n, 2)(?, 2)
state    4: (g, 7)
state    5: (["a"-"f"], 6)(g, 6)(["h"-"m"], 6)(n, 6)(["o"-"z"], 6)(?, 4)
state    6: (?, 5)
state    7: (?, 8)
state    8: (g, 9)(?, 3)
state    9: (?, 8)
=============== DFA ===============
   1 : ["a"-"f"]=>2
   2 : ["o"-"z"]=>2 ["h"-"m"]=>2 g=>3
   3A: n=>4
   4A:
state    1  = ( 1 4 5 )
state    2  = ( 4 5 6 )
state    3A = ( 2 3 4 5 6 7 8 )
state    4A = ( 2 4 5 6 )
===================================

Starting with version 3.5, the command line tools are provided in a separate repository, see the link below:

ShinobuAmasaki/forgex-cli

Notes

To do

The following features are planned to be implemented in the future:

Code Convention

All code contained herein shall be written with a three-space indentation.

Acknowledgements

For the algorithm of the power set construction method and syntax analysis, I referred to Russ Cox's article and Yoshiyuki Kondo's book. The implementation of the priority queue was based on the code written by ue1221. The idea of applying the .in. operator to strings was inspired by kazulagi's one. The command-line interface design of forgex-cli was inspired in part by the package regex-cli of Rust language.

References

  1. Russ Cox "Regular Expression Matching Can Be Simple And Fast", 2007
  2. 近藤嘉雪 (Yoshiyuki Kondo), "定本 Cプログラマのためのアルゴリズムとデータ構造", 1998, SB Creative.
  3. ue1221/fortran-utilities
  4. Haruka Tomobe (kazulagi), https://github.com/kazulagi, his article in Japanese
  5. rust-lang/regex/regex-cli

License

Forgex is as a freely available under the MIT license. See LICENSE.