Bobcat parser crashes on English contractions

pmarcis commented 1 year ago

Hi!

When passing tokenised data containing English contractions, the parser crashes. Passing non-tokenised data seems wrong as the parser does not perform tokenisation internally (all punctuation gets attached to words, contractions are attached to the verb).

E.g.:

from lambeq import BobcatParser
bobcat_parser = BobcatParser()
diagram = bobcat_parser.sentence2diagram("Baby didn 't like it")
diagram.draw()

results in:


---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
File .../site-packages/lambeq/text2diagram/bobcat_parser.py:382, in BobcatParser.sentences2trees(self, sentences, tokenised, suppress_exceptions, verbose)
    381     result = self.parser(sentence_input)
--> 382     trees.append(self._build_ccgtree(result[0]))
    383 except Exception:
File .../site-packages/lambeq/bobcat/parser.py:258, in ParseResult.__getitem__(self, index)
    256 def __getitem__(self, index: Union[int, slice]) -> Union[ParseTree,
    257                                                          list[ParseTree]]:
--> 258     return self.root[index]

IndexError: list index out of range

During handling of the above exception, another exception occurred:

BobcatParseError                          Traceback (most recent call last)
Cell In[2], line 1
----> 1 diagram = bobcat_parser.sentence2diagram("Baby didn 't like it")
      2 diagram.draw()

File .../site-packages/lambeq/text2diagram/ccg_parser.py:231, in CCGParser.sentence2diagram(self, sentence, tokenised, planar, suppress_exceptions)
    228 if not isinstance(sentence, str):
    229     raise ValueError('`tokenised` set to `False`, but variable '
    230                      '`sentence` does not have type `str`.')
--> 231 return self.sentences2diagrams(
    232                 [sentence],
    233                 planar=planar,
    234                 suppress_exceptions=suppress_exceptions,
    235                 tokenised=tokenised,
    236                 verbose=VerbosityLevel.SUPPRESS.value)[0]

File .../site-packages/lambeq/text2diagram/ccg_parser.py:161, in CCGParser.sentences2diagrams(self, sentences, tokenised, planar, suppress_exceptions, verbose)
    125 def sentences2diagrams(
    126         self,
    127         sentences: SentenceBatchType,
   (...)
    130         suppress_exceptions: bool = False,
    131         verbose: Optional[str] = None) -> list[Optional[Diagram]]:
    132     """Parse multiple sentences into a list of discopy diagrams.
    133 
    134     Parameters
   (...)
    159 
    160     """
--> 161     trees = self.sentences2trees(sentences,
    162                                  suppress_exceptions=suppress_exceptions,
    163                                  tokenised=tokenised,
    164                                  verbose=verbose)
    165     diagrams = []
    166     if verbose is None:

File .../site-packages/lambeq/text2diagram/bobcat_parser.py:387, in BobcatParser.sentences2trees(self, sentences, tokenised, suppress_exceptions, verbose)
    385                 trees.append(None)
    386             else:
--> 387                 raise BobcatParseError(' '.join(sent.words))
    389 for i in empty_indices:
    390     trees.insert(i, None)

BobcatParseError: Bobcat failed to parse "Baby didn 't like it".

dimkart commented 1 year ago

Hi, you can use lambeq's SpasyTokeniser class to tokenise your sentences before feeding them to the parser. From the command line interface, you can just use the -t option. If you want to provide the sentence already tokenised, be sure to separate the words correctly, i.e. "did" and "n't", as below, otherwise the model will not recognise "didn" as a proper word.

Hope that helps.

pmarcis commented 1 year ago

Thanks! That solves this problem!

CQCL / lambeq

Bobcat parser crashes on English contractions #60