Open axman6 opened 7 years ago
Hi @axman6,
You bring up an interesting question. I've been thinking about this and I have some ideas, but I will need a bit of time to express them in more detail.
The TL;DR is: 1) Using Regexps in Duckling sets the barrier to contribution quite low. Duckling benefited a lot from this, previously in the Clojure incarnation and now already as a Haskell project. Most of the rules were contributed.
2) There's some optimizations we can do with Regexps that we cannot with parser combinators. I have a change in progress that compiles a giant NFA from all the heads of rules, which can filter out rules that don't have a chance of applying. On the test cases that I've tested it with it halves the cost of Duckling. This change basically makes it so that the cost of Duckling doesn't increase with each added rule.
3) There's room for a combinator like constructs in Duckling, but probably not using some stock combinator library. Using a Free Applicative we could make the rules expressible with:
{-# LANGUAGE ApplicativeDo #-}
ruleURL :: Rule
ruleURL = mkRule "url" $ do
m:_:_protocol:_:domain:_:_:_port:_path:_query:_ <- regex
"((([a-zA-Z]+)://)?(w{2,3}[0-9]*\\.)?(([\\w_-]+\\.)+[a-z]{2,4})(:(\\d+))?(/[^?\\s#]*)?(\\?[^\\s#]+)?)"
return $ Token Url $ url m domain
4) The shortcomings of parsing URLs today are orthogonal to the choice of underlying parsing library. We could make it work with Regexps. Or we can have a Regexp that overapproximates and use parseURI :: String -> Maybe URI
to filter out false positives.
I hope to elaborate on 2) and 3) more in the future, this is just a sketch.
Anyways, thanks for thinking about this, if you have ideas we're happy to hear them.
what is NFA?
NFA - https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton Compiling regexps down to NFAs gives predictable polynomial performance.
Would you consider supporting both regexes and parser combinators? I take your point in the simple cases of URLs and emails. However this machinery and contribution model could easily encompass, say, street addresses. I find many parallels between Duckling's and libpostal's architecture, but I find it unlikely we could achieve a sufficiently good street address parser using regex, even with community support.
This might be a non-goal, but it is a worthy one.
@dmvianna I wouldn't discard the idea, though the added value should gain on the complexity cost. You're right about the use of regexes for street addresses; it doesn't work properly. At this point postal addresses are out of scope, as they require a resolution based on an external datasource.
It seems like libpostal is a good candidate for that, so I wouldn't jump in reinventing the wheel. What are the limitations that could be overcome by Duckling?
Recursion. As far as I understand, libpostal
is not a good candidate for parsing free text. For example, let's say I want to parse street addresses out of Patents Abstracts (I have some toy code and data here). Or other cases I routinely find at work, such as 1-25, 30-40 and 150 Collins St
or even 2 Fictional Rd and 3B Epic Av
. libpostal
simply won't parse multiple addresses in one string properly. And guess what, even regex can parse common cases for these examples. Parser combinators can improve on that.
What naïve code won't do is to assign St
to Street
or Saint
according to local context. libpostal
sorta does that by returning all alternatives. Duckling as I understand does implement a probabilistic parser, so it could potentially solve both problems.
@dmvianna I'm open to look at a prototype for US postal addresses to see if we can get something useful even without the resolution part.
Many of the examples of regexes are reached the point where a parser combinator library would be a much better option - a prime example is the URL matcher which can easily be precisely defined using a parser combinator, while at the moment it's fairly ad hoc and loses a lot of information (the path doesn't work for URLs which contain usernames and passwords, something users might want to be able to match on to forbid or warn users who're posting URLs they shouldn't):
(For this specific example, the Network.URI package already provides
parseURI :: String -> Maybe URI
)I don't have an implementation for this yet (nor a preference for combinator library) because I don't fully understand how duckling all fits together, and wanted to open this to start discussion about it.