facebook / duckling

Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.
Other
4.05k stars 720 forks source link

Year extraction not parsed correctly #673

Open PrajnyaSatish opened 2 years ago

PrajnyaSatish commented 2 years ago

Is there a way to check and ignore numerals that have text surrounding them? For example in "from 10 - 16:00 in 650EK" parses 650 as a year -

[
   {
      "body" : "from 10 - 16:00 in 650",
      "dim" : "time",
      "end" : 22,
      "latent" : false,
      "start" : 0,
      "value" : {
         "from" : {
            "grain" : "minute",
            "value" : "0650-01-01T00:00:00.000-08:00"
         },
         "to" : {
            "grain" : "minute",
            "value" : "0650-01-01T16:00:00.000-08:00"
         },
         "type" : "interval",
         "values" : [
            {
               "from" : {
                  "grain" : "minute",
                  "value" : "0650-01-01T00:00:00.000-08:00"
               },
               "to" : {
                  "grain" : "minute",
                  "value" : "0650-01-01T16:00:00.000-08:00"
               },
               "type" : "interval"
            },
            {
               "from" : {
                  "grain" : "minute",
                  "value" : "0650-01-01T10:00:00.000-08:00"
               },
               "to" : {
                  "grain" : "minute",
                  "value" : "0650-01-01T16:00:00.000-08:00"
               },
               "type" : "interval"
            },
            {
               "from" : {
                  "grain" : "minute",
                  "value" : "0650-01-01T22:00:00.000-08:00"
               },
               "to" : {
                  "grain" : "minute",
                  "value" : "0650-01-02T16:00:00.000-08:00"
               },
               "type" : "interval"
            }
         ]
      }
   }
]

I do not however want 650 to be parsed as a year.

stroxler commented 2 years ago

I think that might be hard to do - we want to parse the text into separate tokens because things like 25lbs need text separated (and even dates, like 2021AD).

As a result though, I'm not sure if the current duckling engine can expose when there's text with no space to the rules layer. But, cc @chessai who knows more about the backend than I do.

PrajnyaSatish commented 2 years ago

Hi @chessai, any comments?