CQCL / lambeq

A high-level Python library for Quantum Natural Language Processing
https://docs.quantinuum.com/lambeq/
Apache License 2.0
455 stars 111 forks source link

Feature Request : Default SpaCy and UNK tokenization enabled for the diagrams. #23

Closed ACE07-Sev closed 2 years ago

ACE07-Sev commented 2 years ago

Given tokenization will be present in all use-cases of NLP models, it would be efficient to have it set to True and enabled by default, as well as a SpaCy tokenizer since it would provide a more generalized model (used for indicating words such as he's and he is, they're and they are, and I'm and I am are the same for the model). These tokenizers are used in all use-cases hence it would provide a more efficient and enjoyable experience when using the models given having them built-in.

Furthermore, tokenizers for non ascii letters, numbers, and acronyms (such as idk,tbh,rn etc.) would pose as additional tokenization features, which can be used as additional bool params inside the parser.