Possible mismatch between tokenisations

Currently, in our sentence-to-circuit pipeline, we feed in a tokenized version of the input sentence to bobcat to generate the ccg parse, and hence the expr (the tokenization is done by the SpacyTokeniser in lambeq)

however, the original sentence is later fed into a spacy model to generate coreference chains

if the spacy model produces a different tokenization to the lambeq SpacyTokeniser, this is bad and will cause things to break (e.g. different boxes, mismatched word indices)

We should find some way of ensuring the same tokenisation is used

CQCL / text_to_discocirc

Possible mismatch between tokenisations #27