CQCL / text_to_discocirc

Apache License 2.0
4 stars 1 forks source link

Possible mismatch between tokenisations #27

Open JosephNathaniel opened 1 year ago

JosephNathaniel commented 1 year ago

Currently, in our sentence-to-circuit pipeline, we feed in a tokenized version of the input sentence to bobcat to generate the ccg parse, and hence the expr (the tokenization is done by the SpacyTokeniser in lambeq)

however, the original sentence is later fed into a spacy model to generate coreference chains

if the spacy model produces a different tokenization to the lambeq SpacyTokeniser, this is bad and will cause things to break (e.g. different boxes, mismatched word indices)

We should find some way of ensuring the same tokenisation is used

image