UniversalDataTool / universal-data-tool

Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.
https://universaldatatool.com
MIT License
1.96k stars 191 forks source link

Tokenization of two ints separated by a space #508

Open jmn319 opened 3 years ago

jmn319 commented 3 years ago

Full disclosure, I have only spent a handful of hours with the tool so if there is an easy fix for this my apologies.

I started with a data set where it's very common to see two ints following each other separated by a space (could be a single space or could be multiple spaces). When I go to the labeling UI, I noticed that the two ints are together as one token. They are even tokenized as one token when they are separated by a comma. Screenshots for full repro below.

Any thoughts in how I can get these to be separate tokens? Hoping there are some simple settings I can change.

udt-token1

udt-token2

udt-token3