Currently, in our sentence-to-circuit pipeline, we feed in a tokenized version of the input sentence to bobcat to generate the ccg parse, and hence the expr (the tokenization is done by the SpacyTokeniser in lambeq)
however, the original sentence is later fed into a spacy model to generate coreference chains
if the spacy model produces a different tokenization to the lambeq SpacyTokeniser, this is bad and will cause things to break (e.g. different boxes, mismatched word indices)
We should find some way of ensuring the same tokenisation is used
Currently, in our sentence-to-circuit pipeline, we feed in a tokenized version of the input
sentence
to bobcat to generate the ccg parse, and hence the expr (the tokenization is done by theSpacyTokeniser
inlambeq
)however, the original
sentence
is later fed into a spacy model to generate coreference chainsif the spacy model produces a different tokenization to the lambeq SpacyTokeniser, this is bad and will cause things to break (e.g. different boxes, mismatched word indices)
We should find some way of ensuring the same tokenisation is used