[BUG]: Chinese Tokenizer unable to tokenize text without zwsp formatting

GeoDirk commented 1 year ago

Description:

We need the interlinear to parse Chinese text better. Right now, dashboard is interpreting everything between commas as ONE word. Most Chinese words are one or two characters. There is usually no space between words in Chinese texts.

Additional info by Randall:

There are no spaces between Chinese words. As the user says, most Chinese words are one or two characters. Are the Chinese words being segmented properly when aligning? Unless the text you're using already pre-segments the Chinese for you, I suspect the user is complaining that Chinese words are not segmented. The highlighted words has "Moab" linked to "living [in the] land" & doesn't align to the equivalent name "Moab." I suspect that the alignments are basically all wrong because the Chinese has not been pre-segmented.

Steps To Reproduce:

Use and Chinese Paratext Resource and import it with the ZWSP tokenizer. Tokenizer is not doing it's job or we don't have a tokenizer that is applicable for this type of situation.

Actual Results:

All characters (really words) between the punctuation are treated like one big word:

Found in Version:

1.2.0.12

Fixed in Version:

(Clear Dashboard version in which the bug was fixed.)

GeoDirk commented 1 year ago

Additional info from Russell:

If one uses the whitespace tokenizer, indeed, it will tokenize (correctly) by whitespace. Another way to do it is to tokenize by character. I don't recall if Machine has this, but if not it can be built fairly easily. However, chinese 'words' (chunks of meaning) commonly come in character-pairs to my knowledge (my wife). To tokenize such words one of the following would be needed:

a lookup-tokenizer be build that has a list of chinese character-pairs, matches on them, and tokenizes strings of characters based on this list into character-pairs. Such a tokenizer wouldn't be difficult to build programmatically, but I suspect building the list would be.
drafters put in zero width markers between character-pairs and then machine's built-in zero width marker tokenizer be used to tokenize the text. I think this situation (and for agglutinative languages) is what this tokenizer is for.

This is not a trivial issue. Even the newer transformer models struggle with this issue (mBERT, NLLB), and there are no automated ways to tokenize as Randall describes that I'm aware of. However, conceptually, such pairs should be embodied in NLLB's decoder attention layer embeddings, and if someone had figured out how to extract them this could be used to build #1's list. I am not aware of anyone doing this yet, though, even in research.

Perhaps Randall has a list of chinese character-pairs? Then a tokenizer that uses it in a lookup fashion could be used to tokenize the chinese into sensible chunks of meaning that would have a rough correspondence with the other language's tokens.

mBert cannot deal with pairs, but you can see that NLLB does recognize some chinese word pairs (the last cell in this notebook). This is from my own experiments with this. https://colab.research.google.com/drive/14SxTy0p3WVlxiBkBWNEFs0Fdq4eQhAgn?usp=sharing

You can share this notebook with Randall, I'd be happy to experiment with this with him if he needs, but it is pretty experimental.

In this notebook with nllb I am ignoring its decoder and instead using an encoder self-attention layer's embeddings to determine alignments. But I'm guessing as to the layer, there is no documentation from Meta or others on this yet.

themikejr commented 1 year ago

Apologies for the drive-by suggestion, but I happened to see this issue and had something to offer.

I'm under the impression that Andi worked with someone at Stanford to implement a chinese tokenizer in C#. It used some sort of known algorithm to do so. I would expect that somewhere in our archives we have a copy of it. If not, Andi or Charles should be able to help us find it.

GeoDirk commented 1 year ago

@themikejr Are you talking about this project: https://github.com/gbin-org/CLEAR_2_TOOL_UseZeroWidthSpaces ?

themikejr commented 1 year ago

@themikejr Are you talking about this project: https://github.com/gbin-org/CLEAR_2_TOOL_UseZeroWidthSpaces ?

@GeoDirk I don't think so. The linked repo seems to reassemble tokenized Chinese words using invisible Unicode chars. What I'm thinking of is a real-deal Chinese tokenizer pre-Charles, written by Andi and Friends 4+ years ago. It would be in one of the old dumps from Andi or you might just want to ask him about it. I recall he worked with someone from Stanford on it? I can't remember if we had full rights to it.

robertsonbrinker commented 12 months ago

solved in #1075

bpetri5 commented 11 months ago

@GeoDirk Can you give Roman and I access to a Chinese Resource in Paratext for us to be able to test?

GeoDirk commented 11 months ago

@bpetri5 @romanpoz From the Paratext upper left hamburger menu, click on Download/Install Resources. When the dialog window shows up, uncheck the Show resources only in languages that match... so that all the resources show up in the list below. Type in the search box below that chinese and a list of resources you can test out are filtered. The CCB and CSBT are good ones to try out.

When you tokenize now using the Chinese tokenizer, the Chinese text will no longer come in a one big tokenized phrase but as individual words. Some words are one character long others are a couple together.

bpetri5 commented 11 months ago

This feature has been added and is working for the most part. Any additional issues should be added as new tickets. Currently there are a few tickets related

Clear-Bible / ClearDashboard