CQCL / lambeq

A high-level Python library for Quantum Natural Language Processing
https://cqcl.github.io/lambeq/
Apache License 2.0
437 stars 106 forks source link

Bobcat fails with extra space tokens #120

Closed toumix closed 5 months ago

toumix commented 9 months ago

This happens only when tokenising, not with raw strings:

sentence = "Alice  sleeps"

from lambeq import SpacyTokeniser, BobcatParser
tokeniser, parser = SpacyTokeniser(), BobcatParser()
tokens = tokeniser.tokenise_sentence(sentence)
tree = parser.sentence2tree(tokens, tokenised=True)

I get the following error:

ValueError                                Traceback (most recent call last)
File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/lambeq/text2diagram/bobcat_parser.py:291, in BobcatParser.sentences2trees(self, sentences, tokenised, suppress_exceptions, verbose)
    290 try:
--> 291     sentence_input = self._prepare_sentence(sent, tags)
    292     result = self.parser(sentence_input)

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/lambeq/text2diagram/bobcat_parser.py:214, in BobcatParser._prepare_sentence(sent, tags)
    212 spans = {(start, end): {id: score for id, score in scores}
    213          for start, end, scores in sent.spans}
--> 214 return Sentence(sent.words, sent_tags, spans)

File <string>:6, in __init__(self, words, input_supertags, span_scores)

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/lambeq/bobcat/parser.py:62, in Sentence.__post_init__(self)
     61 if len(self.words) != len(self.input_supertags):
---> 62     raise ValueError(
     63             '`words` must be the same length as `input_supertags`')

ValueError: `words` must be the same length as `input_supertags`

The above exception was the direct cause of the following exception:

BobcatParseError                          Traceback (most recent call last)
Cell In[17], line 1
----> 1 tree = parser.sentence2tree(tokens, tokenised=True)

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/lambeq/text2diagram/ccg_parser.py:108, in CCGParser.sentence2tree(self, sentence, tokenised, suppress_exceptions)
    104         raise ValueError('`tokenised` set to `True`, but variable '
    105                          '`sentence` does not have type '
    106                          '`list[str]`.')
    107     sent: list[str] = [str(token) for token in sentence]
--> 108     return self.sentences2trees(
    109                     [sent],
    110                     suppress_exceptions=suppress_exceptions,
    111                     tokenised=tokenised,
    112                     verbose=VerbosityLevel.SUPPRESS.value)[0]
    113 else:
    114     if not isinstance(sentence, str):

File ~/.pyenv/versions/3.10.9/lib/python3.10/site-packages/lambeq/text2diagram/bobcat_parser.py:298, in BobcatParser.sentences2trees(self, sentences, tokenised, suppress_exceptions, verbose)
    296                 trees.append(None)
    297             else:
--> 298                 raise BobcatParseError(' '.join(sent.words)) from e
    300 for i in empty_indices:
    301     trees.insert(i, None)

BobcatParseError: Bobcat failed to parse 'Alice   sleeps'.
toumix commented 9 months ago

Also fails with just a space:

parser.sentence2tree(tokeniser.tokenise_sentence(' '), tokenised=True)

I get some weird error:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x0 and 2048x968)
dimkart commented 9 months ago

Thanks for spotting this, will be fixed in the next release.

dimkart commented 5 months ago

This is now fixed in version 0.4. The issue will be closed.