facebook / duckling

Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.
Other
4.07k stars 724 forks source link

Time interval data causing hang (potential regex dos vulnerability) #338

Open tommilligan opened 5 years ago

tommilligan commented 5 years ago

While running duckling over a large set of data, I ran across a particular example that caused it to hang reproducibly. Trying to extract time from the following text:

from 10.00am on Friday 15th April 2016 until 10.00am on Tuesday 19th April 2016

This example does not cause a crash or traceback, it just causes an infinite hang until the thread is killed. My best guess is

Both clauses (from and until) need to be present, neither is sufficient to cause a crash on its own. Removing trivial parts of the text allows the function call to return:

# empty response (thread killed)
"from 10.00am on Friday 15th April 2016 until 10.00am on Tuesday 19th April 2016"
# very slow
"from 10.00am Friday 15 April 2016 until 10.00am Tuesday 19 April 2016"
# works
"from 10.00am Friday 15 April until 10.00am Tuesday 19 April"

See #339 for a branch reproducing this case.

mauricedoepke commented 5 years ago

Some more examples to maybe narrow the problem down: Working: from 10.00am on Friday 15th until 10.00am on Tuesday 19th April from 10.00am on Friday 15th April until 10.00am on Tuesday 19th

Not Working: from 10.00am on Friday 15th April until 10.00am on Tuesday 19th April

cristicbz commented 5 years ago

By bisecting the time rules, i traced the problem is to do with the following rules (in Time/EN/Rules.hs).

  , ruleIntervalDash -- much slower (+40s by itself)
  , ruleIntervalFrom -- slower (+2s by itself)

Removing both fixes the issue. Adding only one of them changes the runtime as described in the comments. Adding both pretty much sums up the delays i.e. it does terminate, it just take a very long time.

patapizza commented 5 years ago

Hey all, thanks for reporting. This is due to the complexity in resolving dates and times while computing segment intersections. The rarity of these inputs doesn't allow us to prioritize more efficient time arithmetics at the moment.

mcobzarenco commented 3 years ago

Having the same issue too

chessai commented 3 years ago

Indeed, this does terminate, but takes a long time:

> debug (makeLocale EN Nothing) "from 10.00am on Friday 15th April 2016 until 10.00am on Tuesday 19th April 2016" [Seal Time]
from <datetime> - <datetime> (interval) (from 10.00am on Friday 15th April 2016 until 10.00am on Tuesday 19th April 2016)
-- regex (from)
-- intersect (10.00am on Friday 15th April 2016)
-- -- intersect (10.00am on Friday)
-- -- -- <time-of-day> am|pm (10.00am)
-- -- -- -- hh:mm (10.00)
-- -- -- -- -- regex (10.00)
-- -- -- -- regex (am)
-- -- -- on <day> (on Friday)
-- -- -- -- regex (on)
-- -- -- -- Friday (Friday)
-- -- -- -- -- regex (Friday)
-- -- <day-of-month>(ordinal) <named-month> year (15th April 2016)
-- -- -- ordinal (digits) (15th)
-- -- -- -- regex (15th)
-- -- -- April (April)
-- -- -- -- regex (April)
-- -- -- regex (2016)
-- regex (until)
-- intersect (10.00am on Tuesday 19th April 2016)
-- -- <time-of-day> am|pm (10.00am)
-- -- -- hh:mm (10.00)
-- -- -- -- regex (10.00)
-- -- -- regex (am)
-- -- on <day> (on Tuesday 19th April 2016)
-- -- -- regex (on)
-- -- -- intersect (Tuesday 19th April 2016)
-- -- -- -- Tuesday (Tuesday)
-- -- -- -- -- regex (Tuesday)
-- -- -- -- <day-of-month>(ordinal) <named-month> year (19th April 2016)
-- -- -- -- -- ordinal (digits) (19th)
-- -- -- -- -- -- regex (19th)
-- -- -- -- -- April (April)
-- -- -- -- -- -- regex (April)
-- -- -- -- -- regex (2016)
[Entity {dim = "time", body = "from 10.00am on Friday 15th April 2016 until 10.00am on Tuesday 19th April 2016", value = RVal Time (TimeValue (IntervalValue (InstantValue {vValue = 2016-04-15 10:00:00 -0200, vGrain = Minute},InstantValue {vValue = 2016-04-19 10:01:00 -0200, vGrain = Minute})) [IntervalValue (InstantValue {vValue = 2016-04-15 10:00:00 -0200, vGrain = Minute},InstantValue {vValue = 2016-04-19 10:01:00 -0200, vGrain = Minute})] Nothing), start = 0, end = 79, latent = False, enode = Node {nodeRange = Range 0 79, token = Token Time TimeData{latent=False, grain=Minute, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 0 4, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},Node {nodeRange = Range 5 38, token = Token Time TimeData{latent=False, grain=Minute, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 5 22, token = Token Time TimeData{latent=False, grain=Minute, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 5 12, token = Token Time TimeData{latent=False, grain=Minute, form=Just (TimeOfDay {hours = Nothing, is12H = False}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 5 10, token = Token Time TimeData{latent=False, grain=Minute, form=Just (TimeOfDay {hours = Just 10, is12H = True}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 5 10, token = Token RegexMatch (GroupMatch ["10","00"]), children = [], rule = Nothing}], rule = Just "hh:mm"},Node {nodeRange = Range 10 12, token = Token RegexMatch (GroupMatch ["","a","","m"]), children = [], rule = Nothing}], rule = Just "<time-of-day> am|pm"},Node {nodeRange = Range 13 22, token = Token Time TimeData{latent=False, grain=Day, form=Just DayOfWeek, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 13 15, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},Node {nodeRange = Range 16 22, token = Token Time TimeData{latent=False, grain=Day, form=Just DayOfWeek, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 16 22, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}], rule = Just "Friday"}], rule = Just "on <day>"}], rule = Just "intersect"},Node {nodeRange = Range 23 38, token = Token Time TimeData{latent=False, grain=Day, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 23 27, token = Token Ordinal (OrdinalData {value = 15}), children = [Node {nodeRange = Range 23 27, token = Token RegexMatch (GroupMatch ["15","th"]), children = [], rule = Nothing}], rule = Just "ordinal (digits)"},Node {nodeRange = Range 28 33, token = Token Time TimeData{latent=False, grain=Month, form=Just (Month {month = 4}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 28 33, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}], rule = Just "April"},Node {nodeRange = Range 34 38, token = Token RegexMatch (GroupMatch ["2016"]), children = [], rule = Nothing}], rule = Just "<day-of-month>(ordinal) <named-month> year"}], rule = Just "intersect"},Node {nodeRange = Range 39 44, token = Token RegexMatch (GroupMatch ["un",""]), children = [], rule = Nothing},Node {nodeRange = Range 45 79, token = Token Time TimeData{latent=False, grain=Minute, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 45 52, token = Token Time TimeData{latent=False, grain=Minute, form=Just (TimeOfDay {hours = Nothing, is12H = False}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 45 50, token = Token Time TimeData{latent=False, grain=Minute, form=Just (TimeOfDay {hours = Just 10, is12H = True}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 45 50, token = Token RegexMatch (GroupMatch ["10","00"]), children = [], rule = Nothing}], rule = Just "hh:mm"},Node {nodeRange = Range 50 52, token = Token RegexMatch (GroupMatch ["","a","","m"]), children = [], rule = Nothing}], rule = Just "<time-of-day> am|pm"},Node {nodeRange = Range 53 79, token = Token Time TimeData{latent=False, grain=Day, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 53 55, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing},Node {nodeRange = Range 56 79, token = Token Time TimeData{latent=False, grain=Day, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 56 63, token = Token Time TimeData{latent=False, grain=Day, form=Just DayOfWeek, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 56 63, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}], rule = Just "Tuesday"},Node {nodeRange = Range 64 79, token = Token Time TimeData{latent=False, grain=Day, form=Nothing, direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 64 68, token = Token Ordinal (OrdinalData {value = 19}), children = [Node {nodeRange = Range 64 68, token = Token RegexMatch (GroupMatch ["19","th"]), children = [], rule = Nothing}], rule = Just "ordinal (digits)"},Node {nodeRange = Range 69 74, token = Token Time TimeData{latent=False, grain=Month, form=Just (Month {month = 4}), direction=Nothing, holiday=Nothing, hasTimezone=False}, children = [Node {nodeRange = Range 69 74, token = Token RegexMatch (GroupMatch []), children = [], rule = Nothing}], rule = Just "April"},Node {nodeRange = Range 75 79, token = Token RegexMatch (GroupMatch ["2016"]), children = [], rule = Nothing}], rule = Just "<day-of-month>(ordinal) <named-month> year"}], rule = Just "intersect"}], rule = Just "on <day>"}], rule = Just "intersect"}], rule = Just "from <datetime> - <datetime> (interval)"}}]